-
Simulation suite, merger-tree construction, and cosmology sampling are under-specified, preventing assessment of realism/generalization and blocking reproducibility. Sec. 2.1 and Sec. 3.1 state “40 cosmologies, 25 trees each” across $\Omega_m$ and $\sigma_8$ ranges, but do not provide the simulation code/suite, box size, particle number, mass/force resolution, snapshot cadence/redshifts, halo finder, tree-building algorithm, halo selection criteria (mass thresholds and whether final-halo masses are fixed/narrow-binned), or how the 40 cosmologies were sampled (grid vs Latin hypercube vs random). It is also unclear whether the 25 trees per cosmology are statistically independent or share volume-specific artifacts.
Recommendation: Expand Sec. 2.1 (and cross-reference in Sec. 3.1) with a compact but complete dataset table: (i) simulation suite name and citation (or code + IC generator), (ii) box size, $N_{\rm particles}$, particle mass, force softening, (iii) snapshot list/redshift range used for trees, (iv) halo finder and merger-tree builder (with versions/settings), (v) halo selection: final redshift, mass definition (e.g., $M_{200c}$), mass bin/range, and how trees are chosen (random? top-N by mass? per-environment?), and (vi) cosmology design: how 40 $(\Omega_m, \sigma_8)$ pairs are sampled and what prior bounds are. If trees come from shared volumes, describe how you ensure evaluation is not driven by volume/simulation-ID artifacts.
-
The concentration-resampling augmentation is not convincingly motivated as “assembly-bias mitigation” and is not empirically validated; it may also create physically inconsistent/out-of-distribution trees. Sec. 2.3 and Sec. 3.3 rely on an M–C fit with extremely low $R^2$ ($\approx 0.0036$) and large scatter, yet the method is described as an assembly-bias correction. Resampling concentration independently conditional on mass (and ambiguously on scale factor) risks destroying temporal coherence along the main progenitor branch and correlations with other node properties (e.g., $V_{\max}$), potentially altering cosmological signal and/or introducing artifacts. The paper does not show a with/without augmentation comparison for (i) inference metrics, (ii) sensitivity to assembly-bias proxies, or (iii) physical consistency checks.
Recommendation: Reframe the augmentation as an exploratory robustness test unless you can demonstrate a measurable benefit. Concretely: (i) clarify whether the M–C relation is fit globally or per scale-factor bin (resolve Sec. 2.3 vs Sec. 3.3 inconsistency), and report fit parameters + scatter per bin if applicable; (ii) add ablations: train/evaluate the full pipeline with and without augmentation, reporting $\Omega_m/\sigma_8$ RMSE and coverage (Sec. 3.5–3.6); (iii) add robustness stratification: evaluate errors/coverage across quantiles of assembly-bias proxies identified in Sec. 3.2.3 (e.g., conc_scatter_final_snapshot, $\delta\log_{10}c_{\rm sat-cen}$, nummajor_mergers_mpb), comparing augmented vs non-augmented; (iv) add physical consistency checks pre/post augmentation (distributions of $C$, $C$–$V_{\max}$ relation, and correlation of concentration along MPB). If gains are not clear, tone down claims in the Abstract, Sec. 1, Sec. 3.6, and Sec. 4.
-
Key baselines and ablations are missing, so the added value of (a) contrastive learning, (b) the GNN, and (c) augmentation cannot be isolated. Sec. 3.2 shows engineered global features strongly correlate with $\Omega_m$ (and somewhat with $\sigma_8$), but there is no inference baseline using these features. There is also no comparison to a supervised GNN trained to regress $(\Omega_m, \sigma_8)$ directly, nor to a contrastive variant better aligned with continuous labels. Without these baselines, it is unclear whether the proposed method materially improves over simpler summaries or whether $\sigma_8$ weakness is a representation-learning artifact.
Recommendation: Add a baseline/ablation block in Sec. 3.5 (or a dedicated subsection): (i) engineered-features $\rightarrow$ SNPE baseline (all 35 features as summaries; Sec. 3.2); (ii) a reduced physically motivated feature subset baseline (e.g., early-time MPB concentration features plus formation time and merger counts; Sec. 3.2.3); (iii) supervised GNN regression baseline (predict $\Omega_m$, $\sigma_8$ with MSE; use penultimate embeddings as SNPE summaries, or directly compare regression performance); and (iv) augmentation ablation (on/off). Report the same RMSE and coverage metrics used for the main model, emphasizing $\sigma_8$.
-
The contrastive setup uses discrete cosmology IDs as positives/negatives (Sec. 2.4) despite continuous targets $(\Omega_m, \sigma_8)$, and it may encourage shortcut learning via non-cosmological cues. Defining positives as “any other tree from the same cosmology” can let the model exploit simulation- or selection-specific artifacts constant within cosmology (e.g., tree size distributions, mass-range differences, resolution effects, volume/seed effects) rather than cosmology-dependent physics. The current formulation also does not enforce smoothness or metric structure with respect to continuous $(\Omega_m, \sigma_8)$, which may contribute to poor $\sigma_8$ performance (Sec. 3.5–3.6).
Recommendation: First, fully specify the positive/negative construction (Sec. 2.4): are positives only two augmentations of the same tree, or also different trees sharing the same cosmology? Then add diagnostics to detect shortcuts: (i) correlate embeddings with simple structural properties (node count, max depth, final mass) and report whether these predict cosmology; (ii) provide kNN cosmology classification accuracy in embedding space on held-out cosmologies to test generalization; (iii) if final masses vary, explicitly control for mass (e.g., narrow mass bin, reweighting, or conditioning on mass). Finally, consider a continuous-label-aware objective: supervised contrastive loss weighted by distance in $(\Omega_m, \sigma_8)$, and/or an auxiliary regression head jointly trained with contrastive loss. Reassess $\sigma_8$ constraints after this change (Sec. 3.5).
-
Inference calibration assessment is incomplete without SBC (Sec. 2.6), and current claims rely on limited coverage estimates that are insufficient to establish posterior validity—especially given $\sigma_8$ undercoverage and $\Omega_m$ overcoverage. Sec. 3.5–3.6 reports coverage on a small test set (trees from 6 cosmologies) and a single credible level (90%), but does not provide multi-level coverage curves, PIT-style diagnostics, or SBC rank histograms. It is also unclear how intervals are computed (central vs HPD) and how aggregation across trees/cosmologies is done.
Recommendation: Either complete SBC as proposed in Sec. 2.6 or substantially qualify calibration statements throughout (Sec. 3.5–3.6, Sec. 4.3–4.4). At minimum: (i) report coverage across multiple nominal levels (e.g., 50/68/90/95%) for each parameter; (ii) clarify interval type (HPD vs central) and aggregation (per-tree vs per-cosmology); (iii) add PIT histograms (or equivalent) for $\Omega_m$ and $\sigma_8$; (iv) if SBC remains blocked, explicitly describe the technical failure mode and what part of the pipeline prevents SBC.
-
Core methodological details are missing for both the contrastive GNN and SNPE, preventing reproduction and making it hard to diagnose $\sigma_8$ underperformance. For the GNN/contrastive training (Sec. 2.4, Sec. 3.4), the manuscript lacks the explicit NT-Xent formula and key hyperparameters (temperature $\tau$, similarity function, embedding normalization), batch composition (cosmologies per batch, samples per cosmology), optimizer and LR schedule, dropout/norm layers, epochs/early stopping, and the exact augmentation multiplicity $K$ actually used. For SNPE (Sec. 2.5, Sec. 3.5), prior specification, flow architecture details, rounds/simulations per round, training epochs, and evaluation protocol are insufficiently documented.
Recommendation: Add a concise hyperparameter/configuration table (main text or appendix) covering: (i) explicit NT-Xent loss definition, similarity metric, $\tau$, embedding normalization, batch construction, and how many positives per anchor; (ii) GNN architecture details (layer types, activations, normalization, dropout), optimizer/LR schedule, batch size, epochs, early stopping; (iii) the exact augmentation operator and $K$ used in the reported results; (iv) SNPE specifics: priors on $\Omega_m$ and $\sigma_8$, flow type and architecture (layers/hidden sizes/spline bins), number of rounds, training steps, and number of posterior samples used for metrics. Include random seeds and library versions if possible.
-
Normalization leakage: figures and/or preprocessing appear to use normalization statistics computed over the full dataset (including validation/test), which constitutes information leakage and can bias both representation learning and reported analyses (Figures 1–3 and related text).
Recommendation: Recompute all scalers (means/stds) using training data only, apply the same transform to validation/test, and regenerate affected figures/metrics. State this explicitly in Sec. 2.1/Sec. 3.1. If results change, report the updated performance numbers in Sec. 3.5–3.6.