-
Dataset provenance, cosmological coverage, and target definition are insufficiently specified, preventing scientific interpretation of the reported $R^2$ and MSE (Sec. 2.1.1, Sec. 3.1, Sec. 4.1). The manuscript does not clearly state the simulation suite(s) (e.g., AbacusSummit or otherwise), box size/resolution, halo finder and tree builder, whether baryonic physics is present, the selection of root halos (mass/redshift), node pruning/cuts, snapshot/redshift sampling, and—critically—the ranges/priors and sampling strategy for $\Omega_m$ and $\sigma_8$ and how many distinct cosmologies are represented. Without these, the difficulty and generality of the task (broad regression vs. narrow interpolation) cannot be assessed, and the performance metrics are hard to contextualize.
Recommendation: Add a dedicated dataset subsection in Sec. 2.1 that includes: (i) simulation suite name(s), box size, mass resolution, number of boxes, and physics (DMO vs hydro); (ii) halo finder + merger-tree algorithm and any key settings; (iii) explicit ranges/priors for $\Omega_m$ and $\sigma_8$, number of unique cosmologies, and trees per cosmology; (iv) how root halos are selected (e.g., $z=0$ roots, root-mass range), node-level cuts (minimum progenitor mass, treatment of disrupted halos), and snapshot times/redshifts used (linking “scale factor” to node time); and (v) summary statistics/histograms of target distributions and tree sizes across train/val/test.
-
Potential information leakage due to splitting “by tree” rather than “by cosmology/simulation” is not addressed, and could substantially inflate test performance (Sec. 2.3.2, Sec. 3.1). If multiple trees share the same underlying cosmology (common in simulation suites), then random tree-level splits allow the model to implicitly learn cosmology-specific artifacts and generalize only within-cosmology rather than to unseen cosmologies.
Recommendation: Define and implement leakage-safe splits. At minimum, report results for (i) a split by cosmology (all trees from a given cosmology assigned to a single split) and/or (ii) split by simulation box/realization if multiple realizations per cosmology exist. In Sec. 2.3.2 describe the split unit explicitly (tree vs cosmology vs box), and in Sec. 3.1 report performance for both the original and leakage-safe splits (with identical metrics), discussing any gap.
-
Claims of improvement over “traditional summary-statistic methods” and the necessity of the TTN inductive bias are not supported because no baselines are evaluated (Abstract, Sec. 1, Sec. 3.1, Sec. 4.2–4.4). Without comparisons, it is unclear whether the TTN is outperforming simpler alternatives (e.g., root-only features, summary-statistics regressors, DeepSets, or standard GNNs).
Recommendation: Add baseline experiments trained/evaluated on identical splits (especially the leakage-safe split) and report results in a table in Sec. 3.1: (i) root-only regressor (MLP on root node features); (ii) summary-statistics regressor (linear/RandomForest/MLP) using interpretable tree summaries (main-branch mass assembly, formation time, major/minor merger counts, progenitor-mass moments vs scale factor, node count); (iii) topology-aware baseline such as a message-passing GNN (GraphSAGE/GIN) using the same node features and edges; and optionally (iv) a DeepSets “bag-of-nodes” model to test whether topology matters. Update Abstract/Sec. 4 claims accordingly (quantify gains or soften claims if comparable).
-
Method and implementation details are insufficient for reproducibility and for assessing capacity/overfitting risk (Sec. 2.2–2.3). Key missing items include: exact embedding MLP and output head architecture (layer widths, activations, normalizations), final hyperparameters ($d_{\rm embed}$, $d_{\rm bond}$, max_children), optimizer settings, LR schedule, batch size, number of epochs, early stopping criteria, regularization, initialization of basis tensors, parameter count, and how batching works with variable tree shapes.
Recommendation: Expand Sec. 2.2–2.3 with a reproducibility checklist: (i) explicit layer-by-layer definitions for NN_embed and the output head; (ii) final hyperparameter values used for Sec. 3.1 results (and what was tuned); (iii) optimizer (e.g., Adam/optax) with full hyperparameters, LR schedule, batch size, epochs, early stopping; (iv) initialization for each $T_k$ and network weights; (v) total trainable parameter count (broken down into embedding, tensors, head) and training/validation curves; and (vi) a public code link or at least pseudocode for the full training loop and TTN construction.
-
Tree definition and TTN construction contain ambiguities (rooting, edge direction, “children” meaning, ordering/permutation invariance, max_children handling) that are central to correctness (Sec. 2.1.1, Sec. 2.1.4, Sec. 2.2.2–2.2.3). Merger trees are time-directed DAGs; depending on convention, “parent/child” can swap between progenitor/descendant. Also, if child ordering is deterministic (e.g., sorted by mass or time), it may leak additional information; if arbitrary, the model may not be permutation invariant.
Recommendation: In Sec. 2.1.4 and Sec. 2.2.2–2.2.3: (i) define edge_index direction explicitly (progenitor$\rightarrow$descendant or reverse) and map it to TTN parent/child roles; (ii) define the unique root (e.g., $z=0$ descendant) and confirm all graphs are connected acyclic trees after preprocessing; (iii) specify how children are ordered and whether the model is intended to be permutation invariant—if invariance is desired, enforce it (e.g., symmetric tensor constraints, commutative pooling, or randomized child order during training); (iv) report how nodes with arity $> {\rm max\_children}$ are handled (cap/merge/prune) and how arities $< {\rm max\_children}$ select tensors (one tensor per exact arity vs masking/shared parameters); and (v) add a schematic figure showing a small merger tree mapped to tensors and contraction order, with index labels.
-
Evaluation lacks robustness, uncertainty quantification, and cosmology-specific diagnostics; point-estimate MSE alone is not enough for cosmological inference, especially given known $\Omega_m$–$\sigma_8$ degeneracies (Sec. 3.1, Sec. 4.2). Only a single split/seed result is shown; residual structure (bias vs parameter value, heteroskedasticity vs tree size/root mass) and prediction covariance are not characterized.
Recommendation: Augment Sec. 3.1 with: (i) multiple random seeds and (if feasible) multiple splits, reporting mean$\pm$std for MSE, MAE, and $R^2$; (ii) bootstrap confidence intervals on metrics; (iii) residual plots binned by true $\Omega_m$ and $\sigma_8$ (quantify regression-to-the-mean) and by tree size/root mass; (iv) report the 2D error covariance (or correlation) of prediction errors to assess degeneracy directions; and (v) consider a simple uncertainty-aware head (e.g., heteroscedastic Gaussian regression) or ensembling to provide predictive uncertainties and basic calibration checks.
-
Interpretability claims are currently qualitative and in places methodologically weak: embedding weight magnitudes are not reliable feature-importance measures for MLPs, and tensor norms can scale with tensor order/initialization rather than learned physical meaning (Sec. 2.4.2, Sec. 3.2.1–3.2.3, Sec. 4.3). The analysis also does not clearly state sample sizes or selection criteria for case studies.
Recommendation: Strengthen Sec. 3.2 with quantitative, dataset-level attribution: (i) permutation importance and/or integrated gradients across the full model for the four node features; (ii) leave-one-feature-out or retrain ablations (drop scale factor, drop mass, etc.) reporting performance deltas; (iii) topology vs features tests (randomize child order; shuffle topology while keeping node features; swap subtrees between trees) to isolate what information the TTN uses; (iv) controlled merger masking by mass ratio to quantify the impact of major vs minor mergers on predictions (report distributions of $|\Delta\Omega_m|,\ |\Delta\sigma_8|$); and (v) clearly state how many trees are analyzed in each interpretability plot and how they are selected (random vs high-error vs representative).