-
Scientific framing: the paper repeatedly presents the task as “predicting halo assembly bias,” but the model actually predicts an internal formation-time proxy. Assembly bias is fundamentally a statement about clustering/large-scale bias dependence at fixed mass, which is not evaluated here (Sec. 1, Sec. 3.5, Sec. 4).
Recommendation: Either (a) reframe throughout (Abstract, Sec. 1, Sec. 3.5, Sec. 4, and figure titles/captions) as “predicting a formation-time proxy from merger trees,” explicitly treating assembly bias only as motivation; or (b) add a direct assembly-bias validation: at fixed mass, split halos by true and by predicted formation-time proxy and measure a standard assembly-bias observable (e.g., large-scale bias from the two-point correlation function or bias estimator), reporting the strength of the signal and how well predictions reproduce it (Sec. 3.5). Ensure title and conclusions match the final scope.
-
Target definition is inconsistent and insufficiently motivated. Sec. 2.1.2 describes both the “earliest” main-branch node and the “median scale factor” of main-branch nodes, while elsewhere it is described as the median (Sec. 2.1.2, Sec. 3.1, Sec. 3.3). This undermines reproducibility and comparability to standard formation-time definitions in the literature.
Recommendation: Unify the definition in Sec. 2.1.2 with a precise formula/pseudocode (using `mask_main` and scale factor), and ensure consistent terminology in Sec. 3.1, Sec. 3.3, Sec. 3.5, and all figure labels. Add brief motivation and comparison to standard definitions (e.g., $a_{1/2}$ from main-branch mass growth, or formation redshift when a given mass fraction is assembled). If feasible, compute at least one conventional formation-time metric on the same halos and report its correlation with the chosen proxy (Sec. 3.5).
-
Potential information leakage / tautology: the target is computed from scale factor values in the tree, while scale factor is also an input node feature and the node-regularizer explicitly forces embeddings to be maximally correlated with scale factor (Sec. 2.1.2, Sec. 2.3.2, Sec. 3.4). This raises the possibility that the network is largely learning to read out a statistic of an already-provided feature rather than learning assembly history in a physically meaningful sense.
Recommendation: Add targeted baselines and ablations in Sec. 3.3–3.4: (i) non-graph baselines using only summary statistics of node scale factors (and other node features) without message passing; (ii) remove scale factor from node inputs and retrain; (iii) remove the node-regularizer (MSE-only) and compare; (iv) optionally, remove `mask_main`-related information if it is provided to/used by the model at inference (clarify in Methods). Report MSE/$R^2$/MAE for each setting. This will demonstrate whether graph structure and non-time features contribute beyond directly encoding/regularizing time.
-
Missing baselines and loss/architecture ablations make it hard to assess the added value of (a) the GNN versus simpler models and (b) the custom regularizers. With test $R^2 \approx 0.48$, it is unclear how much the approach improves over straightforward regressors on hand-crafted features or pooled node features (Sec. 2.2–2.3, Sec. 3.3–3.5).
Recommendation: In Sec. 3.3, include at least: (1) mean-predictor baseline; (2) linear regression and/or random forest on hand-crafted tree summaries (e.g., main-branch mass history summaries, node-feature moments, accretion-rate summaries); (3) an MLP on globally pooled node features; (4) GNN with MSE-only; (5) GNN with MSE+node-reg only; (6) GNN with MSE+edge-reg only (if meaningful). Summarize results in a table and discuss what each comparison implies about the contribution of graph structure and regularization (Sec. 3.5).
-
Core methodological details are underspecified or internally inconsistent, limiting reproducibility and diagnosis of the edge-regularization failure. Notably: (i) the exact GCN layer/variant is not stated; (ii) how edge attributes enter the convolution is described inconsistently (per-message concatenation vs. pre-aggregation of incoming edges before the first layer); (iii) Pearson-correlation regularizers lack precise definitions (tensor shapes, scalarization, masking, per-graph vs batch computation) (Sec. 2.2, Sec. 2.3.2–2.3.4, Sec. 3.2–3.4).
Recommendation: Expand Sec. 2.2 and Sec. 2.3 with implementable specificity: name the library layer (e.g., PyG `GCNConv`) and its options (self-loops, normalization), clarify directed vs undirected handling for merger-tree edges, and provide explicit formulas for the node/edge regularizers: define the scalar node score used (e.g., mean over embedding dims or a learned projection), the index set over which correlation is computed (all nodes vs main-branch nodes), and how correlations are aggregated across graphs/batches (with variance safeguards). Make Sec. 2.2 and Sec. 3.2 fully consistent and include pseudocode in an appendix if needed.
-
Edge-level regularization appears inoperative ($\approx 0$ throughout training), raising the possibility of a bug (e.g., detached tensors), a constant/zero-variance statistic, or a definition that yields vanishing gradients. Additionally, it is unclear how “edge embeddings” are obtained, since a vanilla GCN produces node embeddings (Sec. 2.3.3–2.3.4, Sec. 3.4).
Recommendation: In Sec. 3.4 (and Methods), precisely define how edge embeddings are constructed (e.g., MLP on $[h_u, h_v, e_{uv}]$ or a separate edge network), and verify gradient flow from the edge term (e.g., report nonzero gradient norms attributable to that term on a batch). Add a diagnostic: compute the relevant correlations outside training using the current definitions to check variance and expected magnitude; include a zoomed/inset plot for the edge-regularizer curve and report representative values with adequate precision. If the term is fundamentally ill-posed for the current architecture, remove it from claims and present it as future work, or replace it with a better-defined edge-aware auxiliary task (Sec. 3.5).
-
Robustness/uncertainty is not quantified: results are reported for a single split and (apparently) a single seed on only $1000$ graphs, limiting confidence in the stated performance and conclusions (Sec. 2.1.4, Sec. 2.4–2.5, Sec. 3.3–3.4).
Recommendation: Run multiple seeds and/or repeated random splits (or $k$-fold CV) and report mean $\pm$ std for MSE/$R^2$/MAE on the test set (Sec. 3.3). Mark the model-selection criterion (best validation epoch) and specify whether selection is by validation MSE or total loss (Sec. 2.4–2.5, Sec. 3.2). If compute is limited, at minimum add a sensitivity check over a few seeds and explicitly caveat conclusions in Sec. 3.5 and Sec. 4.
-
Dataset provenance and representativeness are inadequately described. The manuscript does not clearly state the simulation name/code, cosmology, resolution, halo finder, merger-tree builder, selection cuts, mass/redshift range, or tree-size distributions—key context for assessing generality and potential leakage (e.g., correlated objects across splits) (Sec. 2.1, Sec. 3.1, Sec. 3.5, Sec. 4).
Recommendation: In Sec. 2.1, add a concise dataset paragraph: simulation and cosmology, box size and mass resolution, halo finder and tree builder, snapshots/redshift range, halo selection (mass cuts, central/satellite, etc.), and how $1000$ trees were sampled. In Sec. 3.1, report distributions of halo mass, proxy values, and graph sizes (nodes/edges/depth). Briefly discuss possible split leakage risks (e.g., if trees share ancestry or are environmentally correlated) and how splits were constructed to mitigate this.