-
Central methodological claim is not supported by the current GNN experiments: the GNN is trained on only $\sim 10$–$50$ trees for $\sim 2$ epochs (a tiny fraction of the $\sim 15\rm{k}$ available trees), and results are reported on an extremely small reduced test subset, yielding clear underfitting and strong class-1 bias (Sec. 2.5.1–2.5.2, Sec. 3.3–3.5, Conclusion). With such limited training, it is not possible to assess whether the architecture/contrastive objective is competitive or whether observed ablation effects are robust.
Recommendation: Either (i) substantially scale up GNN training (more trees and epochs, to near-convergence) and report learning curves (train/val loss, macro/weighted F1 vs epoch) plus final results on the official held-out test set (e.g., CS_tree_test.pt), or (ii) explicitly reframe the paper as a compute-limited pilot/proof-of-concept and avoid comparative performance conclusions. In both cases, quantify for every GNN run: $\#$trees, $\#$transitions, class counts, $\#$epochs, hardware/time budget, and provide variability across multiple random seeds/subsamples (mean$\pm$std), since small-subset metrics are very noisy.
-
Evaluation protocol and data partitioning are not unified across RF and GNN, undermining comparability and risking leakage/confounding from correlated transition samples within trees (Sec. 2.1, Sec. 2.5.1, Sec. 3.1–3.3). The RF is tuned/evaluated on larger validation subsets, while the GNN reports results on a reduced test subset; it is unclear whether both models are ever evaluated on the same trees/transitions or on the full provided test split.
Recommendation: In Sec. 2.1 and Sec. 2.5, provide a single explicit splitting description at the *tree level* (no tree in multiple splits) and report, for each split used in each experiment: $\#$trees, $\#$nodes, $\#$edges/transitions, and per-class supports. Then enforce a common evaluation: (a) evaluate both RF and GNN on the same held-out test set (preferably the full CS_tree_test.pt), or (b) if compute forces reduced subsets, use identical reduced subsets (same tree IDs) for both models. Add uncertainty estimates (multiple seeds/subsamples) for all headline metrics in Sec. 3.2–3.5.
-
Task definition and label construction are under-specified in a way that affects the physical meaning of samples and the learning problem (Sec. 2.1.3, Sec. 3.1). In particular: (i) merger trees can have multiple progenitors per descendant—it’s unclear whether transitions are drawn from all progenitor$\to$descendant edges or only from the main progenitor branch; (ii) grouping “decrease” and “no change” into class 0 is not quantified (ties may be rare in float), and threshold sensitivity is not explored; (iii) concentration comparisons appear to use $x[u,1]$ where $x$ is $\log_{10}$(concentration) (and possibly standardized), which is monotonic but matters if any $\epsilon$/rounding is used.
Recommendation: In Sec. 2.1.3, specify precisely which edges become samples (all progenitor edges vs main branch only) and justify the physical interpretation for minor-progenitor edges if included. Report the fraction of exact ties and consider a tolerance $\delta$ (increase if $\Delta c>\delta$) or a three-way split (decrease / $\sim$constant / increase) at least as a sensitivity check. Explicitly state whether $Y_{\rm transition}$ uses linear concentration $c$, $\log_{10}(c)$, or standardized $\log_{10}(c)$, and keep notation consistent.
-
Potential information leakage / unclear forecasting setting: the RF feature set appears to include descendant and time-of-descendant information (e.g., mass$_v$, sf$_v$, $\Delta$sf), and the GNN message passing may aggregate from “future” (descendant) nodes depending on edge direction and whether reverse edges/undirected conversion is used (Sec. 2.2, Sec. 2.3.1, Sec. 2.5.1, Sec. 3.2–3.4). Without an explicit statement of whether the task is (a) causal forecasting using information available at time $\mathrm{sf}_u$ only or (b) retrospective classification using both endpoints, the scientific interpretation and fairness of comparisons are unclear.
Recommendation: Define the intended prediction setting in Sec. 1/Sec. 2.1.3: forecasting (use only information available at $\mathrm{sf}_u$) vs retrospective (allow using $\mathrm{sf}_v$ and/or node $v$ features). If forecasting, remove descendant features from the RF (or clearly separate results), and for the GNN restrict message passing and subgraphs so embeddings for $u$ cannot aggregate from nodes with $\mathrm{sf}>\mathrm{sf}_u$ (and explicitly document whether reverse edges are added and which PyG operators are used). If retrospective, make that explicit and ensure both RF and GNN have access to comparable endpoint information.
-
Mismatch between the edge-level target and the stated GNN prediction head: the label is per transition ($u\to v$), but the classifier is described as using only the progenitor embedding $z_u$ (Sec. 2.3.3, Sec. 3.3). This discards explicit information about $v$ and about the time interval $\Delta \mathrm{sf}$, and makes it harder to interpret what the model is learning—especially in trees where nodes can participate in multiple edges.
Recommendation: Reformulate the GNN as an explicit edge/transition classifier: predict using a function of both endpoints and interval, e.g., $h = \mathrm{MLP}([z_u, z_v, z_u-z_v, z_u\odot z_v, \Delta \mathrm{sf}])$ or a dedicated edge network. Clearly state the transition-to-embedding mapping under PyG batching (Sec. 2.5.1). If you keep a node-only head, justify why $z_u$ alone should determine the sign of concentration change and clarify how multi-edge cases are handled.
-
GNN architecture/training and contrastive objective are underspecified, limiting reproducibility and making it difficult to diagnose class bias and underfitting (Sec. 2.3.1–2.3.3, Sec. 2.5.1–2.5.2). There are inconsistencies in wording (GraphConv vs GraphSAGE), missing hyperparameters (hidden sizes, $\#$layers, aggregation type, dropout/residuals, normalization), and incomplete contrastive details (multi-positive handling, what happens with zero positives in a batch, batch size in transitions, label balance). The loss is effectively supervised contrastive learning, but is not labeled as such.
Recommendation: Expand Sec. 2.3 and Sec. 2.5 to fully specify: exact PyG conv operator(s) and settings, $\#$layers, hidden dims, activations, normalization, dropout/residuals, optimizer/scheduler, learning rate/weight decay, batch construction (trees per batch; transitions per batch), $\alpha$ and $\tau$ values, and projection-head architecture. Clarify the contrastive formulation as supervised contrastive learning; specify whether positives are all same-label samples in-batch, how multiple positives are aggregated, and how anchors with no positives are treated. Consider class-balanced batching and/or class-weighted BCE/focal loss to address the observed degeneracy toward class 1 (Sec. 3.4).
-
Baseline feature specification and “fairness” of the RF vs GNN comparison are unclear: the RF uses hand-engineered environment/merger features (counts/mass of partners, major-merger flags), while the GNN’s ability to recover comparable information via message passing is not analyzed, and the RF feature dimensionality is inconsistent across sections (Sec. 2.2, Sec. 2.4, Sec. 3.1–3.2; Table 1; Fig. 6).
Recommendation: Provide a definitive RF feature list (ordered vector, final dimensionality, units/transforms) and reconcile all counts across Sec. 2.2/Sec. 2.4/Sec. 3 and Table 1. Add an ablation for parity: RF with only intrinsic+cosmology ($+\Delta \mathrm{sf}$) but without engineered environment features, and compare it to the GNN on the same split (Sec. 3.5). Separately, clarify the major-merger definition and justify/cite a conventional threshold. This will make claims about “graph-based learning vs feature engineering” more defensible.
-
Internal inconsistencies in reported metrics and model specification reduce confidence in results: (i) Table 2 weighted-average recall appears inconsistent with the displayed per-class recalls and supports; (ii) the manuscript states a 2-logit output head but uses BCEWithLogits against a single binary target (major methodological mismatch) (Table 2; Sec. 2.3.3; Sec. 3.4).
Recommendation: Audit the evaluation pipeline and correct Table 2 (or explain any non-standard averaging). Make the classification head/loss mathematically consistent: either use a single logit with BCEWithLogits for $Y\in\{0,1\}$, or use 2 logits with softmax cross-entropy for a class index. State the exact target encoding and loss implementation details.