-
Conceptual mismatch and insufficient definition of the target: the paper frames the prediction target as an “HMF proxy”, but it appears to be a within-tree halo/progenitor mass histogram (often described as normalized) rather than a standard halo mass function (number density per comoving volume at a given redshift) (Sec. 1, Sec. 2.5, Sec. 3.3, Sec. 3.5, Sec. 4). Relatedly, it is unclear at what times/redshifts the masses entering the histogram are taken (all nodes across all snapshots? only a specific scale factor? only progenitors above a threshold), which fundamentally changes the physical meaning of the distribution.
Recommendation: In Sec. 1 and Sec. 2.5, (i) formally define the conventional HMF (e.g., $\mathrm{d}n/\mathrm{d}\log M$ per comoving volume at fixed redshift) and (ii) precisely define the paper’s target with an explicit formula. Specify exactly which nodes contribute to the histogram (all halos across the full tree vs. only at a chosen snapshot; inclusion/exclusion of subhalos; any mass thresholds) and the redshift/scale-factor convention. To avoid overclaiming, rename the target throughout to something like “per-tree progenitor mass distribution / within-tree mass spectrum” unless you provide a quantitative mapping to an HMF. If the goal is ultimately HMF emulation, add a concrete demonstration in Sec. 3.3 or Sec. 3.5 that aggregates predictions across many trees (with correct weighting/normalization, and specifying the effective volume and selection function) and compares to an HMF measured directly from the simulation.
-
Potential leakage / ill-posedness of the main task: node features include (log) mass, and the target is a histogram of masses from the same tree. As written, the pipeline may succeed primarily by compressing and reconstructing the mass distribution (a near-autoencoding problem) rather than learning meaningful formation-history/topology-to-population relations (Sec. 2.2–2.5, Sec. 3.3). Without strong baselines/ablations, it is difficult to assess what is learned beyond “mass-in, mass-out.”
Recommendation: Add explicit baselines and ablations in Sec. 3.3: (i) a non-graph baseline that predicts the histogram from simple mass-only summaries (moments/quantiles, counts above thresholds, or even the raw list of masses pooled via a DeepSets-style model); (ii) an ablation removing the mass feature from node inputs (using only concentration, $v_{\mathrm{max}}$, scale factor, and structure) to test whether non-mass information contributes; and (iii) an ablation using only mass (and optionally scale factor) to quantify how much topology/other features help. Interpret the results accordingly in Sec. 3.5/4: if mass-only is near-optimal, the claim should shift from “formation-history inference” to “learned compression of mass spectra.”
-
Inconsistent and under-specified mass preprocessing and binning, including a contradiction between the single-log transform in Sec. 2.1 (Eq. (1)) and the double-log statement in Sec. 3.4 (“$\log_{10}(\log_{10}(\mathrm{Mass}))$”), as well as ambiguity about log base, offsets, and whether standardized values are being conflated with “denormalized” ones (Sec. 2.1, Sec. 2.5, Sec. 3.4; Fig. 5 as referenced). This undermines interpretability of the reported mass ranges and the predicted histograms.
Recommendation: Unify the preprocessing description across Sec. 2.1, Sec. 2.5, and Sec. 3.4 by giving one explicit, end-to-end definition of the mass variable used for (a) node features and (b) histogram binning/targets. State: (i) whether you use $\ln$ or $\log_{10}$; (ii) whether there is one log or two; (iii) any dimensionless scaling (e.g., $\log_{10}(M/M_0 + \epsilon)$ with $M_0$ and $\epsilon$ specified in the same units as $M$); and (iv) where standardization is applied and how it is inverted for plotting. Provide a table of the 20 bin edges both in transformed space and in physical mass units, and ensure figure axis labels/captions match the actual transform. Correct Sec. 3.4 if the double-log is a typo/artifact; if a double-log is genuinely used, justify it and discuss its implications for physical interpretation.
-
Training protocol and the role of the GMM in the claimed “joint” GNN–GMM–FFNN optimization are unclear and may be technically inconsistent (Sec. 2.4–2.6). The forward pass uses GMM posterior responsibilities, but standard sklearn `GaussianMixture` fitting is non-differentiable in PyTorch, making it unclear whether gradients can propagate through the GMM to the GNN, whether the GMM is frozen, and whether embeddings drift away from the fitted mixture during training.
Recommendation: In Sec. 2.4 and Sec. 2.6, provide an explicit step-by-step training schedule (preferably pseudocode/Algorithm box): (1) whether/how the GNN is pre-trained (e.g., on the cosmology task in Sec. 3.1) and whether it is frozen; (2) when the GMM is fit (once on training embeddings vs. refit periodically/alternating optimization); (3) whether GMM parameters are held fixed; and (4) whether responsibilities are treated as constants for backprop. If the GMM is fit once and frozen, rephrase “joint training” accordingly and discuss embedding drift; if you refit GMM during training, describe the alternation frequency and convergence criteria. If you implemented a differentiable mixture model in torch, describe it and cite/justify it.
-
Insufficient description of simulations, cosmology coverage, and dataset construction; possible train/test leakage due to non-independence of trees (Sec. 2.1–2.2, Sec. 3.1, Sec. 3.4). The manuscript does not specify the simulation code, box size/volume, mass resolution, halo finder and mass definition (e.g., $M_{200c}$), tree-builder, redshift range/snapshots, or how the 1000 trees are selected (by root mass? random halos? hosts/subhalos?). It is also unclear whether trees span multiple cosmologies and how $(\Omega_m,\sigma_8)$ labels are assigned per tree, and whether the split avoids leakage across cosmologies or correlated halos in the same volume.
Recommendation: Add a dedicated dataset/simulation subsection (expand Sec. 2.1) containing: simulation name/code, box size (volume), particle mass and force resolution, halo finder, halo mass definition, tree-building algorithm, snapshot/redshift coverage, and units. Describe tree selection criteria (root halo mass cuts, redshift of root, host/subhalo handling) and provide summary statistics of graph sizes (#nodes/#edges) and root masses. Clarify whether the dataset includes multiple cosmologies; if so, state how many and the parameter ranges, and ensure splits are performed by cosmology (or by simulation realization) to test generalization rather than memorization. If all trees come from one volume, discuss correlation risks and consider split strategies that reduce shared-mode leakage (e.g., by spatial region, by halo mass bins, or by simulation realizations if available).
-
Target construction, output constraints, and loss/metrics are not mathematically aligned: the target is described as a normalized histogram/probabilities (Sec. 2.5, Sec. 3.3), but the FFNN uses independent sigmoids per bin, which does not enforce sum-to-one, and MSE alone is hard to interpret for histogram comparisons (Sec. 2.5–2.7).
Recommendation: First, state explicitly in Sec. 2.5 whether the target is (a) counts per bin or (b) a normalized probability mass function. If (b), enforce normalization (softmax over 20 bins or explicit L1 normalization) and use a distribution-aware loss (cross-entropy/KL/JS; optionally add EMD/Wasserstein as an evaluation metric). If (a), remove probability language, justify sigmoid bounds (or use nonnegative outputs such as softplus), and evaluate with metrics appropriate for counts (possibly Poisson/negative-binomial likelihood). In Sec. 3.3, supplement global MSE with per-bin error plots (especially high-mass tail), and at least one distributional metric (KL/JS or EMD) to demonstrate shape fidelity beyond average squared error.
-
Insufficient evidence that the specific hybrid design (GNN $\rightarrow$ GMM responsibilities $\rightarrow$ FFNN) is necessary: the evaluation reports a single headline test MSE on 100 trees with limited diagnostics, no uncertainty across random seeds/splits, and no ablation demonstrating the value of the GMM step versus directly using embeddings (Sec. 3.3, Sec. 3.5).
Recommendation: In Sec. 3.3 (or a new ablation subsection), add: (i) direct GNN-embedding $\rightarrow$ FFNN prediction (no GMM) and compare performance; (ii) alternative clustering/bottlenecks (e.g., k-means responsibilities, vector quantization) to test whether the benefit is “mixture modeling” specifically; (iii) multiple runs with different seeds and/or multiple data splits, reporting mean $\pm$ std of key metrics; and (iv) stratified performance by root mass, tree size, and (if kept) GMM component. This will quantify what the GMM contributes and reveal systematic failure modes.
-
GMM clustering analysis is currently qualitative and not tied to physical/structural differences among merger trees (Sec. 3.2). BIC selection is presented on a very sparse grid in Fig. 2 and cluster meaning is inferred primarily from low-dimensional projections, which can be misleading without quantitative characterization.
Recommendation: Strengthen Sec. 3.2 by: (i) scanning a contiguous range of component counts (e.g., $K=1$–$20$) with multiple initializations and reporting variability; (ii) specifying which split (train only) was used to fit/score the GMM; (iii) reporting mixture weights and quantitative cluster separation (e.g., silhouette on embeddings; or within-/between-component covariance diagnostics); and (iv) summarizing each component with physically interpretable tree statistics (root mass, node count, formation time proxies like last major merger scale factor, concentration/$v_{\mathrm{max}}$ distributions, etc.). If components do not map cleanly to physical categories, state that explicitly as a limitation.
-
The cosmology pretext task is under-specified and its scientific meaning is unclear without target ranges/normalization, per-parameter errors, and proper baselines; it is also unclear whether the evaluation tests generalization across cosmologies or merely across trees within the same cosmology (Sec. 3.1, Sec. 4).
Recommendation: In Sec. 3.1, specify: the number of cosmologies, parameter ranges for $(\Omega_m,\sigma_8)$, whether targets are standardized, the regression head architecture, and the exact metric definition. Report per-parameter RMSE/MAE in physical units (not only combined MSE), include predicted-vs-true scatter plots, and compare against simple baselines (predict mean; linear regression on hand-crafted tree summaries). If multiple cosmologies exist, evaluate with a split by cosmology (hold out cosmologies) to support the claim that embeddings capture cosmology information rather than overfitting to particular simulations/realizations.