-
Empirical scope is narrow (single dataset and essentially one main OoD mechanism: Gaussian blur), and baselines are limited and not clearly matched. The comparison to the published VCSF baseline appears not to be run under identical conditions (data split, blur strength, metric definition), making the “six-fold improvement” claim hard to interpret (Sec. 2.1, Sec. 3.1–3.2, Conclusions). This limits evidence that the method generalizes to realistic simulation–data mismatches (noise, masking, PSF/beam effects, baryonic modeling differences, small-scale power changes) rather than detecting one specific blur signature.
Recommendation: In Sec. 3, broaden the evaluation by (i) sweeping blur strength $\sigma$ and/or using a distribution of $\sigma$ values (report performance vs $\sigma$), and (ii) adding additional physically motivated OoD scenarios (e.g., shape-noise mismatches, masking/inpainting artifacts, correlated noise, altered small-scale power, alternative baryonic feedback prescriptions if available). Add apples-to-apples baselines under the *same* splits and OoD settings: Flow1-only, Flow2-only, a single conditional flow on concatenated features, and at least one simpler model on the same features (e.g., conditional Gaussian/GMM, one-class SVM, or logistic/regression classifier trained to separate InD vs blurred). For VCSF, either re-run it on your exact dataset/proxy/metric (preferred) or clearly label the comparison as approximate and soften the quantitative “six-fold” claim in Sec. 1 and Conclusions.
-
The anomaly score is repeatedly described as a likelihood-ratio (or likelihood-ratio-type) statistic, but the presented construction is an averaged, calibration-normalized NLL and does not correspond to a Neyman–Pearson log-likelihood ratio without additional assumptions. Additionally, Eqs. (1)–(2) divide by $\sigma_k^2$ (variance), while surrounding language suggests standardization/z-scoring (division by $\sigma_k$). This ambiguity affects interpretation, units/weighting across views, and reproducibility (Sec. 1, Sec. 2.4, Sec. 3.2–3.3; Eqs. (1)–(2)).
Recommendation: In Sec. 2.4, either (a) provide a clear derivation/assumptions under which the proposed statistic approximates a likelihood ratio (explicitly defining the alternative hypothesis and how $\mu_k,\sigma_k$ relate), or (b) remove/replace “likelihood-ratio” terminology throughout (Sec. 1, Sec. 2.4, Sec. 3.2–3.3, captions, Conclusions) and describe it precisely as a calibrated/standardized (or precision-weighted) ensemble NLL score. Separately, resolve whether $\sigma_k^2$ vs $\sigma_k$ is intended: if $\sigma_k^2$ is deliberate, justify the weighting/units and update terminology accordingly; if not, correct Eqs. (1)–(2) and regenerate results if needed. Include a short comparison of alternative normalizations/fusions (raw NLL average, z-score, learned linear fusion on calibration set) in Sec. 3.3.
-
Calibration protocol may introduce evaluation leakage or optimistic reporting: Sec. 2.4 states calibration uses 200 maps “drawn from the evaluation set” to compute $\mu_k$ and $\sigma_k^2$. Unless the calibrated subset is strictly disjoint from all ROC/metric computation and the selection is fixed or repeated over multiple draws, the reported ROC/mean-TPR can be biased. It is also not explicit that calibration uses only InD maps (Sec. 2.4, Sec. 3.2–3.3).
Recommendation: Clarify in Sec. 2.4 and Sec. 3.2: (i) calibration uses *only* InD maps; (ii) the 200-map calibration subset is disjoint from both training and the held-out evaluation set used for ROC/TPR; (iii) how the subset is selected (random seed/protocol) and whether calibration is repeated across multiple draws. Report sensitivity to calibration size (e.g., 50/100/200/500) and to subset choice (multiple random draws) and show variability in mean TPR over $0.1\%\text{--}5\%$ FPR.
-
Core method claims (need for multi-view modeling, benefit of conditioning on $\theta$, and benefit of calibration/normalization) are not supported by ablation studies. Without ablations, it is unclear whether the gains come from the second view, from conditioning, from the blur being easily captured by one feature subset, or from the calibration/fusion heuristic (Sec. 2.2–2.4, Sec. 3.3).
Recommendation: Add an ablation subsection (Sec. 3.3 or new) reporting ROC/mean-TPR@$0.1\%\text{--}5\%$ FPR for: (i) Flow1 alone and Flow2 alone (each calibrated); (ii) ensemble fusion variants (raw-NLL mean, z-score mean, $\sigma^2$-weighted mean, max, learned linear fusion); (iii) conditional vs unconditional flows (remove $\theta$); and (iv) a single conditional flow on concatenated features. Use results to refine claims in Sec. 1 and Conclusions.
-
Feature extraction is under-specified, especially for Fourier-space operations and the bispectrum proxy, limiting reproducibility and interpretability. Missing are: exact feature dimensionality per view; power-spectrum bin edges/spacing and mapping from pixels to $\ell$; treatment of windowing/apodization, complex modes, aliasing; definition of “high-frequency region”; precise construction of adjacent mode triplets $(k_1,k_2,k_1+k_2)$ and the three spectral magnitude moments; and explicit formulas/pooling axes for directional gradient statistics (Sec. 2.2.1–2.2.2).
Recommendation: Expand Sec. 2.2 (or add an Appendix) with implementation-level detail: (i) explicit formulas for gradient statistics and pooling (over pixels/orientations/scales); (ii) FFT pipeline details (windowing/apodization, $\ell$ mapping, handling of complex conjugate modes, any anti-aliasing/deconvolution); (iii) exact definitions/masks for “high-frequency region”; (iv) exact triplet selection procedure and bispectrum-proxy formula with unambiguous indices (e.g., $\phi_{k_1+k_2}$); (v) the final per-view feature vector dimension and ordering (a table is ideal).
-
Normalizing flow architecture and training protocol are insufficiently specified to reproduce results. “Affine coupling” is broad; details on coupling design, number of layers/blocks, hidden widths, activations, (actnorm/batchnorm), conditioning injection ($\theta$ embedding/concatenation locations), regularization (AdamW settings), batch size, learning-rate schedule, early stopping/validation, and whether both flows share architecture are unclear (Sec. 2.3).
Recommendation: In Sec. 2.3 (or Appendix), provide a complete specification: flow family (e.g., RealNVP/Glow style), number of coupling layers/blocks, permutation strategy, MLP widths/depth, activation functions and where applied, scale-parameter constraints (if any), normalization layers, $\theta$ preprocessing/embedding and injection points, optimizer hyperparameters (LR, weight decay), batch size, epochs/steps, validation split and early-stopping criterion, and random-seed handling. Include a small architecture table for both flows.
-
Robustness and uncertainty are presented mostly qualitatively (e.g., Sec. 3.4 score distributions across cosmologies) without quantitative per-cosmology metrics or confidence intervals, and the headline mean TPR (0.8919) is reported without uncertainty over seeds/training/calibration sampling. For scientific deployment, variability is essential (Sec. 3.2, Sec. 3.4).
Recommendation: Add uncertainty quantification: report mean$\pm$std (or CI) over multiple training seeds and multiple calibration draws for the main metrics (mean TPR over $0.1\%\text{--}5\%$ FPR; also TPR@FPR$=0.1\%,1\%,5\%$). In Sec. 3.4, report per-cosmology quantitative summaries (e.g., median InD score, median OoD score, separation, TPR@1% FPR) and summarize their distribution across the 100 cosmologies (table or compact plots).
-
Inconsistency in the OoD proxy parameter $\sigma$ (blur strength): methods state $\sigma=2.0$ pixels, while Figure 1 caption (and related text) uses $\sigma=1.5$ pixels. Because difficulty depends strongly on $\sigma$, this discrepancy directly affects the validity of reported results (Sec. 2.1, Sec. 3.1; Figure 1).
Recommendation: Resolve and align $\sigma$ everywhere (Sec. 2.1, Sec. 3.1, Figure 1 caption/legend). If Figure 1 uses a different $\sigma$ only for visualization, state that explicitly. If experiments used multiple $\sigma$ values, label results per $\sigma$ and report them separately (ideally as part of a $\sigma$ sweep).