-
The TT/QITT tensorization and decomposition description is internally inconsistent between Methods and Results, making the core contribution difficult to reproduce and interpret (Sec. 2.5.1 vs. Sec. 3.2). Sec. 2.5.1 describes flattening $60 \times 74 = 4440$ then reshaping into a 6-mode tensor $(2,2,2,3,5,37)$ (i.e., quantized-style tensorization), while Sec. 3.2 describes reshaping $(60,74)$ to a 3rd-order tensor $(60,2,37)$ and applying a 3-core TT with ranks $(1,2,2,1)$ to obtain 202 features. These are not equivalent constructions and generally yield different cores/features; the reported 202D arithmetic matches the $(60,2,37)$ pathway, not the 6-mode one. It is also unclear which TT algorithm is used (TT-SVD vs. ALS/optimization), whether decomposition is performed independently per tree, and what is meant by “QITT” vs. standard TT in this implementation.
Recommendation: Unify Sec. 2.5.1–2.5.2 and Sec. 3.2 around one definitive procedure that matches the reported experiments. Explicitly specify: (i) the exact tensor shape used in all results (e.g., $(60,2,37)$ or $(2,2,2,3,5,37)$), (ii) the number of TT cores, (iii) the TT algorithm/implementation (e.g., TT-SVD with a fixed rank tuple; library and version), (iv) whether TT is computed per-sample independently, and (v) why this is termed QITT (if it is not quantized TT in the standard sense, consider renaming to TT). Add a short pseudocode/algorithm box and ensure the stated 202D feature construction follows from the finalized pipeline.
-
Substructure identification (“significant substructures”) is under-specified and inconsistent across sections, and truncation to $\max_{N_{\rm sub}} = 60$ may dominate what information is retained (Sec. 2.2.1 vs. Sec. 3.1; Sec. 2.4.2). Sec. 2.2.1 refers to selecting the top 10% mass ratios (and mentions concentration/$V_{\rm max}$ deviations) while Sec. 3.1 reports an adaptive 20th percentile threshold on $\log_{10}(M_{\rm sub}/M_{\rm main})$. The treatment of overlapping/nested branches, multiple triggers, and the exact ordering of substructures before truncation are unclear. Since $N_{\rm sub}$ ranges from 2 to 563 (Sec. 3.1), truncation without a clearly defined ordering/selection rule could discard most substructures in complex trees and bias the representation toward specific epochs or merger ratios.
Recommendation: In Sec. 2.2.1 provide a single, precise algorithmic definition used for all experiments: define the mass-ratio statistic and its percentile direction/value (10% vs 20th percentile), specify numeric thresholds and computation for concentration/$V_{\rm max}$ deviations (or remove if unused), define how branches are segmented, and how overlaps/nesting are resolved. In Sec. 2.4.2 explicitly define the truncation policy (which 60 are kept and in what order—e.g., highest mass ratio, earliest, latest, longest-lived). Add a sensitivity study varying (a) the percentile threshold (e.g., 10/20/30%) and (b) $\max_{N_{\rm sub}}$ (e.g., 20/40/60/100), reporting both substructure-count statistics and $\Omega_{\rm m}/\sigma_8$ performance.
-
Padding with a “null substructure” produced by running the pretrained GraphSAGE encoder on a single-node graph with average features may introduce a non-neutral, systematic signature correlated with the amount of padding (Sec. 2.4.2, Sec. 3.1). This can inadvertently encode (via the fraction of padded slots) a proxy for substructure count/tree complexity, and TT compression may exploit this in ways that are hard to interpret as representing physical content rather than missingness.
Recommendation: Ablate padding strategies in Sec. 2.4.2 / Sec. 3.3: compare (i) all-zero 74D padding, (ii) the current single-node-graph embedding padding, (iii) a learned padding token (trained downstream), and/or (iv) an explicit mask feature per substructure (append an is_pad bit or provide a mask to the model). Quantify whether prediction error correlates with padding fraction, and report the effect on performance—especially for trees with $N_{\rm sub} \ll 60$.
-
GraphSAGE autoencoder architecture/training details are inconsistent and incomplete, and potential representation-learning leakage is not ruled out (Sec. 2.3.2 vs. Sec. 3.1). The paper alternates between three GraphSAGE layers and two SAGEConv layers, and does not fully specify hidden sizes, activations, decoder, loss definition (node-feature reconstruction only vs. adjacency/topology), training schedule, validation/early stopping, or regularization. The text also alternates between “a large corpus of generated graphs” and “33,759 substructures from the training set,” leaving ambiguity about what data were used and whether any substructures from validation/test simulations were included (transductive leakage).
Recommendation: Standardize the GNN description between Sec. 2.3.2 and Sec. 3.1 and add a complete specification: encoder/decoder architecture (layer types, widths), embedding dimension, pooling, losses and targets (what is reconstructed), optimizer, LR, batch size, epochs, early stopping/validation, regularization. State explicitly that pretraining uses substructures from training simulations only (or clearly justify otherwise) and that embeddings are frozen for downstream regression unless fine-tuning is performed (in which case describe the protocol). Include a small ablation on embedding dimension and/or a non-topological control to verify the embedding adds information beyond smoothed node attributes.
-
Baseline set and ablations do not yet isolate the value added by (i) TT/QITT compression vs. generic regularization/compression, and (ii) learned topology vs. physical features; some announced baselines (e.g., graphlet counts) are under-specified or not fully reported (Sec. 2.7.1, Sec. 3.3). The key comparison “flattened 4440D $\rightarrow$ Linear Regression” is known to fail in high dimension with limited samples; stronger baselines like Ridge/ElasticNet and PCA/PLS are needed to demonstrate TT-specific benefit. Additionally, there is no clear ‘physical-only QITT’ variant to quantify what the GraphSAGE embeddings contribute once TT is applied.
Recommendation: Expand Sec. 3.3 with targeted ablations/baselines: (i) QITT/TT on physical-only tensors $((60, N_{\rm phys}) \rightarrow \text{TT} \rightarrow \text{features})$ vs topology-only vs combined; (ii) Ridge/ElasticNet on flattened 4440D features with CV; (iii) PCA (or PLS) to 202D followed by the same regressors as QITT; (iv) concatenate global aggregate features with QITT features to test complementarity; (v) fully specify and report graphlet baseline metrics (Sec. 2.7.1 and Sec. 3.3) or remove it if not executed. This will make claims about QITT and learned topology much more defensible.
-
Statistical significance testing likely overstates evidence because it treats the 150 test trees as iid, despite clustering by simulation (6 test simulations $\times$ 25 trees) (Sec. 2.8, Sec. 3.3.5). Paired t-tests on per-tree squared errors can be pseudo-replication if errors are correlated within a simulation, inflating apparent significance. This is particularly important for claims that QITT_XGBoost significantly outperforms other baselines.
Recommendation: Redo significance testing with simulation-aware blocks: e.g., compute per-simulation mean error ($n=6$ paired points) and run paired tests on those, or use a clustered bootstrap/permutation test that resamples at the simulation level. Report both per-tree and per-simulation aggregated performance (mean$\pm$std across simulations) to assess robustness.
-
Core feature dimensionalities are inconsistent, which propagates into tensor sizes and TT setup (Sec. 2.3.1, Sec. 3.1, Sec. 2.4.1). Sec. 2.3.1 claims a 10D physical feature vector, but the enumerated components can be read as 12D (mass ratio, merger scale factor, two property differences, and mean+std over four properties). Sec. 3.1 additionally reports ‘num_halos_in_branch’, creating further ambiguity about whether this is an input feature or only descriptive. Since $74=10+64$ and $4440=60\times 74$ are used throughout, this needs to be exact.
Recommendation: In Sec. 2.3.1 provide an explicit ordered list of the physical features actually used as model inputs, with an unambiguous count, and clarify whether ‘num_halos_in_branch’ (Sec. 3.1) is included in the tensor or only for reporting. Update all dependent dimensionality statements in Sec. 2.4–2.5 and all related figures/tables to match the true input dimension.