-
References/citations appear inconsistent and in multiple places clearly out of scope, including implausible years and unrelated topics cited as core background for photo-z, covariate shift, and calibration (Sec. 1; Secs. 2.1–2.4; References). Key relevant prior work on photo-z posteriors, LSST/DESC photo-z efforts, ZBoost/FlexZBoost-like CDE approaches, density-ratio estimation/importance weighting, and calibration/temperature scaling is missing or mis-cited. This substantially undermines the paper’s scholarly grounding and traceability of claims.
Recommendation: Perform a systematic citation audit across Secs. 1–4: remove placeholder/unrelated entries; replace with domain-appropriate references (photo-z posteriors, LSST/DESC challenge context, conditional density estimation for redshifts, covariate shift/density-ratio estimation, posterior calibration/temperature scaling, SHAP/TreeSHAP). Ensure in-text citation indices match the corrected bibliography. Add a short Related Work subsection after Sec. 1 that explicitly positions this pipeline relative to existing reweighting (e.g., SOM/kNN/histogram), CDE photo-z methods, and calibration approaches.
-
Experimental protocol is under-specified and may permit subtle leakage: the manuscript does not clearly report dataset sizes and train/validation/test splits per Task Set and simulation/depth, nor whether the *target photometric sample used for density-ratio estimation* overlaps with the final evaluation target set (Sec. 2.1; Sec. 2.3; Sec. 3). Even if no labels are used, fitting the target feature distribution on the same sample later used for evaluation can produce optimistic results or obscure sensitivity (especially for BTS which uses PIT on a validation set).
Recommendation: In Sec. 2.1 (and/or an appendix), add a table listing counts for Cardinal/Flagship $\times$ 1yr/10yr $\times$ Task Set 1/2, and explicit split proportions/counts (train/val/test) plus whether splits are stratified (e.g., by $z$ or $i$-mag). In Sec. 2.3–2.4, clearly define which split is used for (i) training FlexZBoost, (ii) training the domain classifier / estimating weights (unlabeled target), (iii) fitting BTS temperatures, and (iv) final held-out evaluation. If the same target set is currently used for weight-fitting and evaluation, switch to a held-out target subset or use cross-fitting (fit weights on one target fold, evaluate on another) and report that protocol.
-
Density-ratio (importance-weight) estimation is central to Task Set 2 gains but is not reproducible and lacks stability diagnostics (Sec. 2.3; Sec. 3.2–3.3). Missing items include: exact feature list and preprocessing order (NaN handling/flags/standardization), how the train-vs-test classifier dataset is constructed (sample sizes, balancing/priors), full XGBoost configuration/tuning/seed, classifier performance (ROC–AUC and calibration), and quantitative details/justification for 99th-percentile clipping. Eq. (1) also yields odds $P(\mathrm{test}|x)/P(\mathrm{train}|x)$ and needs an explicit class-prior factor to equal $p_{\mathrm{test}}(x)/p_{\mathrm{train}}(x)$ unless the classifier is trained with equal priors.
Recommendation: Expand Sec. 2.3 to specify: (i) features used (confirm if the same 35 standardized features from Sec. 2.1), and the exact preprocessing sequence; (ii) domain-classifier training set construction, including whether classes are balanced and what priors are implied; (iii) the full XGBoost hyperparameter set, early stopping, validation strategy, and random seeds; and (iv) the numeric clipping threshold (actual value at the 99th percentile) and rationale. Add diagnostics: weight distribution (before/after clipping), effective sample size (ESS) after weighting, and sensitivity to clipping (e.g., 95/99/99.5%). Around Eq. (1), explicitly state and handle the prior-correction relation $p_{\mathrm{test}}/p_{\mathrm{train}} = (P(\mathrm{test}|x)/P(\mathrm{train}|x))\cdot(\pi_{\mathrm{train}}/\pi_{\mathrm{test}})$, or document that $\pi_{\mathrm{train}}=\pi_{\mathrm{test}}$ by construction.
-
Bin-wise temperature scaling (BTS) is crucial to the calibration conclusions but is not defined precisely enough to assess robustness or overfitting risk (Sec. 2.4; Sec. 2.5; Sec. 3.1–3.3). The paper does not fully specify: bin-edge construction for the $5\times5$ $(z_{\rm mode}, i\text{-mag})$ grid (fixed vs quantiles), the exact scalar objective (“sum of PIT-KS and PIT-RMSE”) and its mathematical definition, the optimizer/search method and constraints ($T_b>0$; any regularization/smoothing across bins), treatment of sparse bins, and—most importantly—clear separation of calibration data from final test evaluation.
Recommendation: In Sec. 2.4, provide an explicit algorithm: define bins and edges, provide formulas for PIT-KS and PIT-RMSE and how they combine into a single loss, specify the optimization method (grid search ranges/steps or continuous optimizer), and constraints/regularization. Report per-bin counts (typical and minimum). Clearly state the data split used to fit $T_b$ and confirm evaluation is performed on a strictly held-out test set not used for BTS. Add a small robustness check (appendix acceptable): alternative binning (e.g., $4\times4$ vs $5\times5$) and/or alternative calibration objectives (e.g., NLL/CRPS/coverage-based) to show conclusions are not an artifact of one design.
-
FlexZBoost (core CDE model) and SHAP analysis are not sufficiently specified for reproducibility, and the validity of densities used in Eq. (2) is unclear (Sec. 2.2; Sec. 2.4; Sec. 3.4). The manuscript omits the explicit basis family, how non-negativity/normalization of $p(z|x)$ is enforced (relevant because power transforms require $p_{\rm raw}(z)\geq0$), hyperparameter tuning strategy, and whether separate models are trained per Task Set/simulation/depth. For SHAP, it is unclear which regressor(s) are explained (coefficients per basis term), which SHAP variant is used, and how SHAP values are aggregated across basis models and validated for stability.
Recommendation: In Sec. 2.2, write the explicit form of the basis expansion for $p(z|x)$, name the basis functions, and state how densities are guaranteed non-negative and normalized on the $z$-grid (or document any rectification + renormalization before Eq. (2), and how zeros are handled numerically). Provide the full XGBoost hyperparameters, seeds, and whether they are tuned or fixed; clarify model-training multiplicity across datasets (Task Set 1/2; Cardinal/Flagship; 1yr/10yr). In Sec. 3.4, document SHAP precisely: which trained configuration(s) are analyzed, TreeSHAP vs other, which data split is used, any subsampling, and how SHAP is aggregated across coefficient regressors; add a brief stability check (e.g., feature-rank variability across subsamples/seeds).
-
Baseline competitiveness and statistical uncertainty are not adequately established (Sec. 3.1–3.3; Table 3). The “Naive Baseline” is reported with approximate ($\sim$) values from preliminary runs, and tables report single numbers without uncertainty. Without exact baselines under the same split/protocol and variability estimates, it is difficult to quantify the true benefit of importance weighting and BTS or judge significance of modest differences (e.g., between simulations or depths). External baselines (common photo-z or CDE methods) are also absent.
Recommendation: Replace approximate baseline entries in Table 3 with exact results computed under the final protocol. Report uncertainty via multiple random seeds (even 3) and/or bootstrap CIs over the test set for key metrics (bias, $\Sigma_{\rm MAD}$, outlier rate, CDE loss, PIT-KS). If feasible, add at least 1–2 external baselines under the same splits (e.g., a standard ML regressor with post-hoc density construction, a template-based code, or another CDE approach) or clearly justify why this is not possible and limit claims accordingly.
-
Claims about Roman NIR “indispensability” rely primarily on SHAP feature importance, which is correlational and can be distorted by multicollinearity; there is no targeted LSST-only vs LSST+Roman performance ablation (Sec. 3.4; Sec. 4). The manuscript also uses strong causal language and quantitative contribution statements (e.g., “over 30% of predictive power”) without direct controlled comparisons.
Recommendation: Add a controlled ablation in Sec. 3.4 (or appendix): train/evaluate (i) LSST-only and (ii) LSST+Roman models under the same Task Set 1/2 protocols and compare $\Sigma_{\rm MAD}$/outliers/CDE/PIT (overall and in faint/high-$z$ slices). If this cannot be added, temper conclusions in Sec. 3.4 and Sec. 4 to clearly label Roman-band statements as suggestive, and avoid causal terms like “indispensable” absent ablation evidence.
-
Assumptions, overlap, and failure modes under severe covariate shift are not quantified (Sec. 2.1; Sec. 3.2–3.4; Sec. 4). Importance weighting assumes covariate shift with support overlap (no regions where target $x$ has little/no training support). With Task Set 2 (shallow spec vs deep photo), support mismatch is plausible; the paper does not quantify overlap or show how weights behave in low-support regions, nor discuss safeguards (e.g., reject option or uncertainty inflation).
Recommendation: Add an explicit overlap/coverage diagnostic: weight histograms + ESS (can overlap with the Sec. 2.3 additions), plus a plot of weight magnitude vs $i$-mag/$z$, or a nearest-neighbor distance/SOM occupancy measure showing target regions not covered by train. In Sec. 4, add a limitations subsection describing covariate-shift assumptions, support mismatch, sparse-bin calibration risks, and how the method should be used in practice (e.g., clipping policies, flagging/excluding extrapolative objects, or inflating uncertainties).