-
Potential label leakage / ill-defined inference setting due to target-conditioned (“parameterized”) UMAP. Sec. 2.2 describes an embedding conditioned on the target cosmological parameters using UMAP’s supervised/target mechanism (e.g., `target_weight`, `target_metric`). But at inference time the cosmological parameters are unknown (they are what the model must predict). It is currently unclear whether, during `.transform()` on validation/test nodes, the true $(\Omega_m,\sigma_8)$ values are provided to UMAP (which would leak labels into the representation), or whether a different procedure is used (which could materially change results and the “global consistency across cosmologies” claim in Sec. 3.4 and Sec. 4).
Recommendation: Make the information flow explicit with a step-by-step description (training vs. validation/test) in Sec. 2.2 and Sec. 3.1: (i) confirm UMAP is fit only on training nodes; (ii) state exactly what is passed as $y$/targets during $fit$ and during $transform$ for val/test. If true targets are used at val/test, redo the evaluation without target access (to match the intended inference task). Include an ablation with unsupervised UMAP (no target conditioning) and report the performance gap (Sec. 3.2–3.3). If the goal is instead interpolation within a labeled grid (i.e., supervised dimensionality reduction), reframe claims accordingly and clearly state that the embedding uses cosmology labels.
-
Methods contain numerous unresolved placeholders and inconsistencies, preventing reproducibility. Sec. 2.1.1–2.1.2, Sec. 2.2–2.5 and Tables 1–4 include “[Value]”, “[e.g., 8]”, “[e.g., Gaussian]”, etc., while later Results tables (Sec. 3) appear more concrete. As written, readers cannot tell which settings produced the reported results (e.g., final $D_{\rm embed}$, grid resolution $d_k$, kernel/bandwidth choice, TT ranks/regularization, baseline hyperparameters).
Recommendation: Replace all placeholders in Sec. 2.1–2.5 and fully populate (or remove) Tables 1–4. Add a single consolidated “Experiment configuration” table listing: dataset split, $D_{\rm embed}$, UMAP hyperparameters ($n_{\rm neighbors}$, $min_{\rm dist}$, $metric$, $target\_weight$, target scaling), KDE kernel + bandwidth selection rule, grid resolution per dimension, TT ranks and regularization strength(s), optimizer/ALS settings, and baseline hyperparameters. Ensure Methods and Results refer to the same authoritative configuration.
-
KDE tensorization feasibility and exact tensor dimensionality are unclear (and may be intractable as described). With $D_{\rm embed}=8$ (Sec. 3.4) and, e.g., $d_k\approx 6$ bins per dimension (Sec. 2.3), the grid has $6^8\approx 1.68$ million cells per tree. Storing/evaluating this densely for $1000$ trees is potentially many GB and expensive; the manuscript does not explain memory layout, sparse storage, or how KDE evaluation avoids the full grid cost (Sec. 2.3, Sec. 4).
Recommendation: In Sec. 2.3 and Sec. 3 (new short “Computational cost” subsection), report the exact $D_{\rm embed}$ and $(d_1,\dots,d_D)$ used for the main results; estimate tensor size per tree; state whether tensors are stored densely or sparsely; and provide runtime/memory measurements for (i) UMAP fit/transform, (ii) KDE evaluation per tree, and (iii) TT training. If TT is intended to avoid dense materialization, explain concretely how you compute $\langle W, H_i\rangle$ without explicitly storing all of $H_i$, or clarify that tensors are in fact dense and show it is feasible on the reported hardware.
-
“Adaptive KDE” is described ambiguously and appears inconsistent with the cited implementation. The text calls the KDE step “adaptive” (Sec. 2.3, Sec. 2.6.2, Sec. 4) but also references `sklearn.neighbors.KernelDensity`, which is fixed-bandwidth. Bandwidth selection (global vs per-tree vs per-point), whether bandwidths vary across embedding regions, and boundary handling on a finite grid are not specified, even though this representation is central.
Recommendation: Clarify in Sec. 2.3 whether the KDE is (a) fixed-bandwidth, (b) per-tree bandwidth selected by CV, or (c) truly adaptive (e.g., balloon/sample-point bandwidth based on kNN distances). If adaptive, describe the exact rule, hyperparameters, and how it is implemented with (or beyond) scikit-learn. If fixed-bandwidth, remove “adaptive” language throughout and specify the bandwidth selection procedure (search range, CV criterion on training data). Also state whether you store densities or cell masses (density $\times$ cell volume) and how normalization is handled (see also minor issue on KDE normalization).
-
Sparse TT regression is under-specified (configuration, optimization, and what sparsity means), limiting reproducibility and weakening interpretability claims. Sec. 2.4 omits TT-rank choices, initialization, the exact regularized objective, and how ALS handles an L1 penalty (ALS subproblems become non-smooth and need a specific solver/prox step). Moreover, L1 sparsity in TT cores does not automatically imply localized sparsity in the reconstructed full weight tensor $W$, so the interpretation “top bins by $|W|$” (Sec. 3.5) needs justification.
Recommendation: Expand Sec. 2.4 with: (i) explicit TT definition with ranks $r_k$; (ii) the full training objective (MSE + regularizer), specifying whether L1 is on cores or on $W$; (iii) the ALS/proximal update used for L1 (or cite a specific sparse TT/MPS regression method and match it); (iv) rank/$\lambda$ selection procedure and validation protocol; and (v) training/validation curves or train–test gaps to assess overfitting with $\sim$800 training trees (Sec. 3.1). For interpretability (Sec. 3.5), add a sanity check: reconstruct representative slices/marginals of $W$ (or compute effective per-bin weights) and show that “important bins” are stable across seeds/splits.
-
Baseline comparisons are likely too weak to support broader performance/novelty claims. Sec. 2.5.2 and Sec. 3.3 compare mainly to Random Forest / Gradient Boosting on $20$ aggregate statistics. This does not test whether gains come from (i) UMAP conditioning, (ii) using a distributional/histogram representation, or simply (iii) using a much richer representation than the baselines. It also omits strong set-/graph-based baselines commonly used for irregular astrophysical data.
Recommendation: In Sec. 2.5.2 and Sec. 3.3, add at least one stronger baseline that operates on node-level data under the same cosmology-wise split, e.g.: Deep Sets (pooled MLP), Set Transformer, or a simple GNN on the merger-tree graph (if edges are available). Also add a “fair summary” baseline closer to your representation, e.g., histogram/KDE in the original $4$D feature space (or PCA space) without UMAP, to isolate the contribution of manifold learning. If adding baselines is infeasible, explicitly limit claims to the tested baselines and adjust wording in Sec. 3–4 accordingly.
-
Statistical robustness, uncertainty, and split sensitivity are not evaluated, which is important given only 40 cosmologies. Sec. 3.2–3.3 report single-point MSE/R² on one 32/4/4 cosmology split. With 40 cosmologies, results may vary depending on which cosmologies are held out, and “parameter inference” typically requires uncertainty quantification rather than point prediction alone.
Recommendation: Repeat experiments across multiple random cosmology splits or perform k-fold cross-validation over the 40 $(\Omega_m,\sigma_8)$ pairs (Sec. 3.1–3.3), reporting mean$\pm$std (or confidence intervals) for MSE/R². Add uncertainty estimates for predictions (e.g., bootstrap over trees, ensembling over seeds, or conformal intervals) and report calibration diagnostics if framing as “inference.” Provide predicted-vs-true scatter plots per held-out cosmology to reveal systematic biases (Sec. 3.2–3.3).
-
Merger-tree information content vs. “bag of nodes” is unclear, affecting scope and interpretation. The current pipeline appears to ignore explicit parent–child edges/topology after extracting nodes and a scale-factor feature (Sec. 2.1.1, Sec. 2.3). Given the title emphasizes “merger trees,” readers need to know whether the method uses the tree structure or only the multiset of nodes across time, and what is lost by discarding topology.
Recommendation: State explicitly in Sec. 2.1.1 and/or Sec. 2.3 whether edges/topology are used anywhere downstream. If not, rephrase claims to avoid implying topological modeling, and discuss limitations (Sec. 4). Optionally include a topology-aware baseline (e.g., GNN) or add simple topology/MAH engineered features (formation time, main-branch mass accretion history, branching ratios) to test whether explicit structure improves inference.
-
Dataset provenance and astrophysical context are insufficiently described, limiting reproducibility and scientific interpretation. Sec. 2.1 and Sec. 3.1 list counts and parameter ranges but not the simulation suite, box size, mass resolution, redshift outputs, halo finder, tree-building method, or selection criteria. This also prevents assessing sensitivity to known systematics (resolution, halo finder, tree construction).
Recommendation: Add a dedicated dataset subsection in Sec. 2.1 describing: simulation code/name, box size, particle mass, snapshot/redshift range, halo finder and parameters, tree builder, cosmology grid design, and selection cuts (mass thresholds, centrals vs satellites, pruning). Report the distribution of nodes per tree (median/IQR) and how many snapshots contribute. In Sec. 4, discuss how these choices might affect transfer to other simulations or to observations.