-
The contrastive training objective is internally inconsistent and likely mis-specified and/or mis-logged. Sec. 2.5.3 (Eq. (1)) contains a denominator term written as sim($z_k$) that omits the anchor–negative similarity $\operatorname{sim}(z_i, z_{n,k})$, and Sec. 3.2 reports negative validation “InfoNCE losses” (e.g., $-3.8$) even though the defined loss $L = -\log\left( \frac{\exp(\text{pos})}{\exp(\text{pos})+\Sigma \exp(\text{neg})} \right)$ is non-negative by construction. This is a critical correctness/reproducibility problem that undermines confidence in the learned embeddings and all downstream claims (Sec. 2.5.3–2.6, 3.2).
Recommendation: Audit the implementation and reporting of the contrastive objective end-to-end. In Sec. 2.5.3–2.5.4, rewrite Eq. (1) with correct anchor–positive and anchor–negative similarities and explicitly state: (i) whether the positive is included in the denominator; (ii) whether a symmetric loss is used; (iii) reduction/averaging conventions over anchors and (if applicable) multiple positives. In Sec. 3.2, clearly define what scalar is plotted/reported as “validation loss” (loss vs log-likelihood vs negative loss). After correction, rerun training and provide train/validation curves that behave consistently with the stated definition. Update all quantitative results and figures that depend on these embeddings (Sec. 3.2–3.5) and revise the Discussion (Sec. 3.6–3.7) accordingly.
-
The encoder is trained for only two epochs and explicitly not converged (Sec. 2.6, 3.2, 3.6–3.7), yet all main results ($R^2$, similarity distributions, feature importance, PCA/t-SNE) are drawn from this undertrained state. This makes outcomes potentially unstable with respect to initialization, batching, and early-training dynamics, and makes it difficult to interpret the reported strong regression performance as a property of the intended method rather than an artifact.
Recommendation: Train the contrastive encoder to a clear convergence criterion (e.g., stabilization of validation contrastive loss and/or a proxy validation metric such as linear/MLP probe performance on validation embeddings). Report full learning curves (train/val) over substantially more epochs and across multiple random seeds. Recompute and report all downstream metrics (Sec. 3.3), similarity-distribution diagnostics (Sec. 3.2), feature-importance analyses (Sec. 3.4), and embedding visualizations (Sec. 3.5) using the converged encoder; optionally include 2-epoch results only as an explicit ablation/reference point.
-
Insufficient baselines to demonstrate that hierarchical contrastive learning (and topology-aware message passing) is responsible for the performance. Given the surprisingly strong regression after minimal training, it is plausible that simple correlates (e.g., aggregated node-feature statistics) or even random/frozen encoders could produce comparable results, or that topology is not contributing substantially (Sec. 3.3, 3.6).
Recommendation: Add a compact but decisive baseline suite evaluated on the same simulation-level splits (Sec. 2.3.2, 3.1): (a) random-initialized GraphSAGE encoder (frozen) + same pooling + same regressor; (b) feature-only baselines using hand-crafted summaries (mean/variance/max of node features; tree size; depth; mass-function-like summaries) with MLP/RF; (c) a supervised end-to-end GNN regressor trained directly on ($\Omega_m$, $\sigma_8$) without contrastive pretraining; and (d) a topology ablation such as shuffled edges (or an MLP applied per node with pooling, no message passing) to quantify how much graph structure matters. Report test $R^2$/MAE (Sec. 3.3), plus computational cost. Use these results to substantiate (or appropriately temper) the claimed benefits of the proposed approach (Sec. 1, 3.6, 4).
-
The current positive/negative construction is label-based (identical cosmology parameters define positives) and thus is closer to supervised contrastive learning than self-supervised learning (Sec. 2.5.1). In addition, because positives are defined by exact equality of ($\Omega_m$, $\sigma_8$), the method may primarily learn to cluster by simulation/cosmology class rather than learn a representation that supports smooth interpolation in continuous parameter space—yet the downstream task is continuous regression (Sec. 2.7, 3.3, 3.5).
Recommendation: Clarify terminology and intent: explicitly describe the approach as label-/group-supervised contrastive learning (or justify “self-supervised” usage). Then evaluate continuous generalization more directly: (i) structured splits in $\Omega_m$–$\sigma_8$ space (e.g., leave-out a region/corner; train on low $\Omega_m$ and test on high $\Omega_m$) rather than only a random holdout of 5 simulations (Sec. 3.1, 3.3); (ii) error/residual analysis as a function of parameter-space location (Sec. 3.3–3.6); (iii) optionally, embedding-space smoothness diagnostics such as correlation between embedding distances and parameter distances or kNN regression in embedding space. State clearly whether extrapolation beyond the sampled parameter range is supported (Sec. 3.6–3.7).
-
The hierarchical/node-level contrastive component appears ineffective and under-justified. Node-level positives are formed by randomly sampling nodes from different trees with the same cosmology (Sec. 2.5.1), but Sec. 3.2 shows strong overlap between node-level positive and negative similarity distributions even after training. This raises the possibility that the node-level loss contributes little or adds noise, and the paper currently does not quantify its value relative to a graph-only objective.
Recommendation: Quantify the contribution of node-level contrastive learning via ablations: graph-only ($\alpha=0$), node-only ($\alpha=1$), and several intermediate $\alpha$ values in the combined loss (Sec. 2.5.4), all trained to convergence. For each, report: node- and graph-level similarity separation metrics (not only histograms), downstream regression performance (Sec. 3.3), and qualitative embedding plots (Sec. 3.5). If node-level contrastive is retained, revise the node-positive definition to enforce semantic correspondence (e.g., matching by scale-factor bin, mass percentile, depth-from-root, main-branch nodes) and document the exact sampling algorithm (Sec. 2.5.1). If it does not help, simplify the method and reframe the contribution accordingly (Sec. 3.6, 4).
-
Adaptive hard-negative sampling is introduced as a key design (Sec. 2.5.2) but is not convincingly justified or validated; the similarity window $[0.2, 0.6]$ and $K_{\rm neg}$ appear ad hoc, and it is unclear how many eligible negatives exist during training and whether the scheme is stable as embeddings evolve.
Recommendation: Provide a focused ablation where only the negative sampling strategy changes: (i) adaptive windowed hard negatives (current method), (ii) random in-batch negatives, (iii) all valid in-batch negatives, and (optionally) a true “hardest negatives” variant. Report contrastive learning curves, fraction of eligible negatives over training, and downstream regression metrics (Sec. 3.2–3.3). Explain how the window and $K_{\rm neg}$ were chosen (validation sweep or heuristic), and include these settings in a consolidated hyperparameter/config table (Sec. 2.9 or Appendix).
-
Dataset provenance and simulation/merger-tree construction details are insufficient to assess physical representativeness and external validity. Sec. 2.1 and 3.1 cite Parkinson et al. (2007) and Jiang & van den Bosch (2013), but do not clearly state which simulation suite produced the trees, how halos were identified, snapshot spacing, mass resolution, box size, whether graphs are directed/undirected in the GNN, and how the 25 trees per cosmology were selected (random halos? fixed mass bin?). This makes it hard to interpret what regimes the reported results apply to and whether there are hidden confounds.
Recommendation: Expand Sec. 2.1, 2.2.2, and 3.1 with a concrete, reproducible dataset description: simulation suite(s), box size, particle mass, halo finder, tree builder, snapshot/redshift grid, selection criteria for the 25 trees per cosmology, and cosmology sampling scheme/ranges for $\Omega_m$ and $\sigma_8$ (grid vs Latin hypercube, etc.). Include a plot of $\Omega_m$–$\sigma_8$ coverage and a short statement about how this range relates to observational constraints (and that extrapolation is not validated). Also specify whether edges are treated as directed or converted to undirected for GraphSAGE (Sec. 2.4).
-
Regression evaluation is currently limited to aggregate test metrics and a small number of held-out cosmologies (5 simulations). For cosmological inference, it is important to quantify uncertainty, variability across splits/seeds, and performance heterogeneity across parameter space (Sec. 3.3, 3.6).
Recommendation: In Sec. 3.3, add: (i) per-test-cosmology metrics (one point per held-out simulation) and residual/bias plots; (ii) binned error analysis across $\Omega_m$ and $\sigma_8$ (1D bins or a 2D grid); (iii) uncertainty estimates via multiple random seeds and preferably multiple simulation-level splits (cross-validation across the 40 cosmologies). If some systematic biases appear (e.g., at high $\sigma_8$), quantify them and incorporate into the limitations (Sec. 3.6–3.7).
-
Presentation/consistency issues significantly impede verification and actionability: several figures/tables use placeholders (“illustrative values”, “Figures ??”), figure numbering/cross-references appear broken, and some plots lack normalization/sample sizes and quantitative separation measures (Sec. 2.2.1–2.2.3, 3.2–3.7).
Recommendation: Remove or relocate “illustrative” tables (Tables 1–3 in Sec. 2.2.1–2.2.3) unless replaced with real dataset statistics. Fix all figure labels/references using consistent \label/\ref (Sec. 3.2–3.7). For similarity-distribution plots (Sec. 3.2), normalize histograms (probability density), report $N$ for each class, specify binning, and add quantitative separation metrics (e.g., AUROC, KS distance, overlap coefficient). Ensure colorblind-safe palettes and readable typography throughout.