-
OoD evaluation is too narrow to substantiate claims about general “simulation mismatch” (or “new physics”) detection: experiments use a single OoD proxy (Gaussian blur) and all maps come from a single simulation suite, with InD testing performed on a held-out subset of the same dataset (Sec. 2.1, Sec. 3, Sec. 4). Moreover, OoD samples appear to be derived from the same underlying clean maps as the InD split (blurred versions), which can make the task closer to detecting a known corruption of familiar content rather than detecting genuinely unseen simulator behavior.
Recommendation: In Sec. 3, broaden OoD tests beyond isotropic blur to cover qualitatively different structural shifts (e.g., anisotropic/PSF-like convolution, scale-dependent filtering that is not equivalent to smoothing, sharpening/artifacts, altered noise models, or baryonic-like transformations that modify profiles non-uniformly). If possible, evaluate on an independent simulation suite/code or the official challenge test protocol. If additional data are not feasible, explicitly narrow claims in Sec. 1 and Sec. 4 to “blur-like small-scale suppression within this simulation framework,” add a limitations paragraph in Sec. 4, and clearly state whether OoD maps share the same underlying realizations as InD validation maps (and whether this could inflate apparent separability).
-
The whitening step (central to nuisance-robustness claims) is under-specified and only partially validated: the “intra-cosmology covariance” and whitening transform are not defined mathematically, it is unclear whether whitening is global or cosmology-dependent, what samples enter the covariance estimate, how ill-conditioning is handled, and how much dependence is actually removed for all nuisance parameters ($T_{\rm AGN}$, $f_0$, $\Delta z$), not just one (Sec. 2.2, Sec. 3.1, Sec. 3.4). There is also a conceptual ambiguity: whitening aims to suppress nuisance dependence, yet the flow is still conditioned on the nuisance parameters (Sec. 2.2–2.3).
Recommendation: Expand Sec. 2.2 with explicit equations and notation: define $\Sigma$ precisely (e.g., how you average/condition over cosmology vs nuisance), define $\mu$ if centering is applied, and state $z_{\rm white} = \Sigma^{-1/2}(z_{\rm PCA}-\mu)$. Specify the estimator (sample covariance vs shrinkage/diagonal loading), numerical method for $\Sigma^{-1/2}$ (SVD/eigendecomposition), and whether the transform is global or varies with $(\Omega_m,S_8)$. In Sec. 3.1/3.4 add quantitative diagnostics before/after whitening: correlations or mutual information between features/scores and each parameter ($\Omega_m$, $S_8$, $T_{\rm AGN}$, $f_0$, $\Delta z$). Include an ablation (WST$\rightarrow$PCA vs WST$\rightarrow$PCA$\rightarrow$whiten) reporting both OoD metrics and parameter-sensitivity to justify the step and clarify how it complements (rather than conflicts with) conditioning in Sec. 2.3.
-
The anomaly score relies on profiling over $\theta$ via gradient-based optimization, but the optimization is not fully specified or validated: only “10 Adam steps” from an MLP initialization are described, without learning rate, $\beta$ parameters, constraints/ranges, parameterization/normalization, boundary handling, restarts, or evidence that $10$ steps approximates $\min_{\theta} NLL$ (Sec. 2.3, Sec. 3.3). If $\theta$ is unconstrained or under-optimized, scores can be biased (either inflated NLL due to poor minimization or deflated NLL by drifting to unphysical regions). Fig. 2 also suggests the MLP initializer may collapse for $\Delta z$ and be noisy for nuisance parameters, which could materially affect profiling quality.
Recommendation: In Sec. 2.3 explicitly define the feasible set $\Theta$ for Eq. (1) (e.g., box constraints matching the training priors) and describe the constrained optimization method (projection, squashing transforms, or penalty). Provide optimizer hyperparameters, step schedule, and whether the reported score is the minimum along the trajectory or the final iterate. In Sec. 3.3 add a convergence/sensitivity study: (i) number of steps ($0/5/10/20/50$), (ii) initialization (MLP vs random vs oracle/true $\theta$ if available), (iii) effect on AUC/pAUC and on score distributions. For Fig. 2, add quantitative metrics per parameter (RMSE/MAE/bias) and demonstrate that downstream profiling remains robust even when the initializer is poor (especially for $\Delta z$).
-
Baseline comparisons are not yet quantitative and reproducible, weakening claims of superiority: baselines are described qualitatively and some numbers appear not clearly tied to the exact same dataset/split/metric, with no table detailing implementations, hyperparameter tuning, or uncertainty estimates (Abstract, Sec. 2.4, Sec. 3.3, Sec. 4).
Recommendation: Add a dedicated baseline table (Sec. 2.4 or Sec. 3.3) with fully specified methods and results computed under the same protocol: e.g., power spectrum + Gaussian likelihood; WST ($417$) + Gaussian; WST$\rightarrow$PCA($3$) + Gaussian; unconditional flow; conditional flow without whitening; conditional flow without $\theta$-optimization (plug-in MLP $\theta$); and an oracle $\theta$ score if available. Report ROC AUC and pAUC (with bootstrap CIs) for each method on the same InD/OoD split. Clearly distinguish your own runs from numbers taken from external sources, and temper claims accordingly.
-
Key implementation details needed for reproducibility are missing or ambiguous across preprocessing and models: WST configuration (wavelet family, scales/orientations, padding, normalization, spatial averaging), PCA fitting protocol (training-only vs full data; standardization), whitening estimation protocol, and NSF/MLP architectures/training hyperparameters (Sec. 2.1–2.3, Sec. 3.1–3.3). This also connects to potential evaluation leakage if PCA/whitening are fit using validation data.
Recommendation: In Sec. 2.2–2.3 (or an Appendix), provide a complete specification: WST library and parameters; whether coefficients are averaged and how; PCA preprocessing (centering/scaling), and explicitly state PCA/whitening are fit on the training split only and then frozen. For NSF: number of coupling layers, spline bins, base distribution, conditioning network structure, hidden sizes/activations, optimizer settings, batch size, epochs, early stopping, weight decay. For MLP: architecture, loss, training settings, and validation metrics per parameter. Add a short ablation in Sec. 3.1/3.3 showing OoD performance vs number of PCs (e.g., $2/3/5/10$) to justify choosing $3$ beyond variance explained.
-
Metric/reporting ambiguities affect interpretability: (i) pAUC definition/normalization is unclear and the stated random-guess baseline appears inconsistent with “mean TPR over $\mathrm{FPR}\in[0.001,0.05]$” (Sec. 2.4, Sec. 3.3); (ii) NLL values are strongly negative (Fig. 1, Sec. 3.2–3.3) but the exact NLL convention (per-sample/per-dimension, log base, reduction) is not stated, and claims such as “highly negative NLL indicates sharply peaked density” are not well-grounded without calibration; (iii) results appear to be shown for a single training run without seed variability.
Recommendation: In Sec. 2.4 write the exact pAUC formula used (including any division by interval width and any normalization) and correct/justify the random-guess baseline under that same definition. For NLL, specify the precise quantity plotted (per-sample vs averaged; per-dimension recommended), log base, and why negative values are expected for continuous densities; consider reporting bits/dim or NLL/dim and adding a simple likelihood sanity check (e.g., shuffled features or a Gaussian baseline). Report mean$\pm$std (or CI) over multiple random seeds for key metrics (AUC/pAUC) and, where feasible, show variability in training curves (Fig. 1) or summarize it in text.