-
The manuscript is conceptually ambiguous about what is “ILC-free” versus “ILC-anchored” in STsep, and the strongest reported gains come from a hybrid pipeline (ILC initialisation + simulation-trained amplitude priors + contamination ensemble), not from the ILC-free STsep described in parts of Sec. 3.3 and highlighted in Sec. 4.4. As written, some framing in Sec. 1 and Sec. 6 can be read as demonstrating an ILC-free replacement for ILC under realistic noise, which is contradicted by the three-frequency failure and by the reliance on ILC initialisation in Sec. 4.6.
Recommendation: Make the distinction explicit and consistent throughout Sec. 3.3, Sec. 4.4, Sec. 4.6, Sec. 5, and Sec. 6: (i) a purely ILC-free, SED-difference–driven STsep variant that fails at realistic SO noise; versus (ii) the practical best-performing hybrid that is initialised from the six-frequency ILC map and stabilised by amplitude priors and a contamination ensemble. Update the Abstract/Sec. 1/Sec. 6 to describe the successful method explicitly as “ILC-initialised STsep” (or similar), and avoid calling that configuration “ILC-free.” Also state unambiguously what s0 is in the λa‖s−s0‖² term for the canonical six-frequency results (Sec. 3.3, Sec. 4.6), since different parts of the text suggest different initialisations.
-
The contribution of each ingredient in the best-performing STsep configuration is not identified (ST statistics vs ILC anchoring vs amplitude priors vs contamination ensemble). This limits interpretability (“why does it work?”) and makes it hard to generalise or compare fairly to alternative post-processing/denoising approaches.
Recommendation: Add an ablation study (ideally a compact table in Sec. 4.6 or an Appendix) on the same 20 patches at 150 GHz, reporting r, RMS (or σe/σt), D_ST, KS distance, and tail-recovery statistics for: (a) ILC baseline; (b) STsep initialised from ILC with priors (current best); (c) STsep initialised from ILC without mean/variance priors (or with weakened priors); (d) STsep initialised from SED-difference with priors; (e) STsep initialised from noise/zero with priors. If feasible, include a run where the ST loss is removed but priors/proximity remain (to quantify what portion is due to “ST morphology” versus amplitude anchoring and proximity regularisation).
-
STsep’s performance depends critically on truth-derived amplitude priors (μ*, V*) and on a FLAMINGO-based contamination ensemble (Sec. 3.3, 4.5–4.6), but robustness to prior/ensemble mis-specification is not quantified. This is a central “bigger-picture” risk for transfer to real data, where tSZ/foreground statistics and noise mismatch the training suite.
Recommendation: Add a robustness/mismatch experiment (Sec. 4.6, Sec. 5.2, or Appendix): perturb μ* and V* by ±(10–50)% (and/or rescale contamination-ensemble amplitudes) and quantify changes in r, σe/σt, D_ST, KS, and tail recovery. If variants/sub-volumes exist, estimate priors on one subset and apply to another to emulate mismatch. Use the results to revise Sec. 6 to state clearly that current gains are conditional on reasonably well-matched simulation-assisted priors, and outline how such priors might be specified in practice (e.g., suites of imperfect foreground/tSZ simulations, cross-checks with external data).
-
Claims of “beating ILC and FoCUS on every metric” rely primarily on means over 20 patches without rigorous uncertainty quantification or hypothesis testing, despite visible patch-to-patch scatter (Sec. 4.1–4.2, 4.6–4.7). FoCUS-vs-ILC differences are explicitly within scatter; the same standard should be applied to STsep-vs-ILC comparisons, especially for modest absolute changes (e.g., r from ≈0.14 to ≈0.17).
Recommendation: Augment Sec. 4.6–4.7 and the relevant figures/tables (e.g., Figs. 4, 9, 11; Table 4 if present) with uncertainty estimates across patches: mean±1σ, standard errors, and/or bootstrap confidence intervals for each key scalar metric. For direct comparisons, report the fraction of patches where STsep improves over ILC (paired comparison) and include a paired test (paired t-test or Wilcoxon signed-rank) for the principal metrics. Revise wording in the Abstract/Sec. 1/Sec. 6 to match the quantified statistical strength (significant vs modest).
-
The power-spectrum discussion indicates split-cross power remaining above beam-matched truth (factors ~2–5) due to residual contaminants common to splits, but the manuscript’s main headline metrics are map-space (r, RMS, KS, tails). Without harmonic-space correlation/transfer-function diagnostics, it is hard to reconcile “better maps” with “excess power,” and to interpret implications for typical tSZ science analyses (Sec. 4.1, Fig. 4, Sec. 5.1).
Recommendation: Add harmonic-space recovery diagnostics alongside split-cross power: (i) a binned multipole-dependent correlation coefficient, e.g. ρℓ = Cℓ^{m×t}/sqrt(Cℓ^{m×m}Cℓ^{t×t}); and/or (ii) a transfer function estimate Tℓ = Cℓ^{m×t}/Cℓ^{t×t}. These separate “extra residual power” from “true tSZ recovery.” Also quantify split-to-split determinism/noise sensitivity by reporting Var(m_splitA − m_splitB) (or an equivalent) for each method (ILC, FoCUS, STsep), to support statements that STsep outputs are more deterministic across splits.
-
Beam handling and unit consistency are not fully well-defined in key multi-frequency equations and constraints. As written, Eq. (12) and SED-difference initialisations imply direct subtraction across bands without explicit beam matching, despite significantly different beams (Table 1). Similarly, constraints (9)–(10) compare aν ŷ + cν to dν without explicit beam operators, and the unit/meaning of the optimisation variable s is ambiguous (Compton-y vs µK_CMB at 150 GHz), which also interacts with the ILC SED constraint normalisation ambiguity in Eq. (6) (Sec. 2.3, Sec. 3.1–3.4).
Recommendation: Make the forward model explicit with beam operators (e.g., dν = Bν⋆(aν y + …)+nν), then state clearly which objects are beam-equalised and at what stage (pre-smoothing all channels vs embedding Bν in the constraints). Resolve the ILC SED normalisation ambiguity by explicitly defining the SED vector used in Eq. (6) (e.g., a := a_tSZ/a_tSZ(150) so w^T a = 1, or else use w^T a = a150 and adjust the closed form accordingly). Finally, state explicitly whether s denotes (a) ŷ (dimensionless), (b) the 150 GHz tSZ temperature in µK_CMB, or (c) a 150-normalised amplitude, and adjust Eq. (12) and the aν notation to be dimensionally consistent. Include an explicit statement of beam convention for STsep comparisons (why truth is smoothed to B_eff rather than the 150 GHz beam), and provide the distribution of B_eff (e.g., FWHM across patches) in an Appendix to interpret high-ℓ behaviour.
-
STsep optimisation stability and hyperparameter dependence are acknowledged (including “catastrophic divergence” without priors) but not systematically quantified, limiting reproducibility and transfer to other noise regimes (Sec. 3.3, Sec. 4.6).
Recommendation: Provide a compact sensitivity analysis (Sec. 4.6 or Appendix): vary λc, λa, learning rate, number of steps, and (optionally) N_ens batching, and report how RMS ratio, D_ST, KS, and tail recovery respond. Highlight stable ranges and failure modes. If full scans are too costly, include a small set of representative alternate configurations (e.g., ×0.5 and ×2 for key weights, shorter/longer runs) to demonstrate that conclusions do not hinge on a narrowly tuned setting.
-
FoCUS is positioned as a methodological contribution (Sec. 3.4) but its negative/marginal result is under-diagnosed: there is limited exploration of λ, ST statistic choices, or frequency-pair selection, and it is unclear whether FoCUS is inherently weak under realistic noise or simply under-tuned/under-specified.
Recommendation: Either (a) add a compact FoCUS characterisation (Sec. 4.7): scan λ over ~10⁻⁴–1, report the update size ‖s_FoCUS−s_ILC‖/‖s_ILC‖ and changes in r/RMS/D_ST/KS/tails, and try at least one alternative residual choice (e.g., combining Δ90,217 and Δ150,217 or another pair motivated by CIB/tSZ contrast); or (b) explicitly reframe FoCUS as an exploratory proof-of-concept/negative result, shorten Sec. 3.4 accordingly, and tone down claims in Sec. 1 and Sec. 6 to match the demonstrated utility.
-
The ST anomaly diagnostic is highlighted as a key contribution (Sec. 3.5, Fig. 3, Sec. 6), but it is currently qualitative and partially supervised (learned direction \hat d from training tiles). Its practical meaning, calibration, and robustness to noise/beam changes are not quantified, making it hard to evaluate beyond visualisation.
Recommendation: Add a quantitative evaluation in Sec. 5.3 or Appendix: define ground-truth “cluster” masks (halo catalogue or truth |y| thresholds) and compute ROC/PR curves and AUC for anomaly-score detection of cluster regions, comparing anomaly maps derived from ILC vs STsep outputs. Clarify whether the diagnostic uses the same training/priors as STsep and discuss any circularity. If such analysis is infeasible, explicitly downgrade claims in Sec. 3.5 and Sec. 6 to describe the diagnostic as exploratory/qualitative and defer calibration to future work.
-
Several implementation and reproducibility details remain implicit (ST configuration, contamination ensemble construction/batching, spectra details), which may prevent independent re-implementation without access to code (Sec. 2.2–2.3, 3.1–3.3, 4.1).
Recommendation: Add an “implementation checklist” (Appendix is fine) specifying: the full ST/ScatCov configuration (orders S1–S4 used, scales/orientations, any orientation averaging, normalisation/self-normalisation); which coefficients enter Φ in Eq. (11) vs FoCUS Eq. (13) vs D_ST; contamination ensemble selection and normalisation across frequencies (including whether noise is added and how); batching over N_ens during optimisation (batch size, whether ensemble moments are recomputed per step); details of map apodisation/pixel window in power spectra; and explicit ℓ-bin edges/centres for “24 log-spaced bins over 500≤ℓ≤6000.” If code will be released, provide a repository URL and version/commit; otherwise provide enough detail to replicate results.