-
Baseline comparisons are far too weak to support the paper’s central claims. Sec. 3.1 / Fig. 1 compare mainly against a trivial “point mass at origin” baseline, which is not representative of modern neural emulators or prior structure-preserving approaches. As a result, it is impossible to attribute gains specifically to (i) Hamiltonian parameterization, (ii) conservative forces via autograd, or (iii) symplectic unrolling (Secs. 1, 3.1, 4).
Recommendation: Expand Sec. 2.4 (metrics/protocol) and Sec. 3.1–3.2 (results) to include a meaningful baseline suite and targeted ablations, run on identical train/test splits and horizons. At minimum include: (a) a non-symplectic state-to-state baseline (MLP/RNN/GNN next-step predictor) integrated with Euler/RK methods; (b) a force-regressing GNN baseline (learn accelerations/forces directly) with the same integrator; (c) an HNN with the same $U(q)$ but trained without symplectic unrolling (e.g., one-step loss or vector-field loss) and/or evaluated with a non-symplectic integrator; and (d) an “integrator ablation” where the learned Hamiltonian is integrated with RK4 vs leapfrog to isolate the role of symplectic integration. Report trajectory error and invariant metrics (Sec. 3.2) for $N=25/50/100$ across these baselines.
-
Quantitative reporting is insufficient and often qualitative (“significant,” “exceptionally low,” “$\approx 1$”). Figures and text do not provide enough explicit numerical values, uncertainty, or precise metric definitions (Sec. 2.4, Sec. 3.1–3.2; Fig. 1–2). This weakens evidence and prevents reproducible comparison.
Recommendation: In Sec. 2.4, give explicit mathematical definitions for each metric, including how quantities are aggregated across particles, coordinates, time, and test simulations (e.g., per-particle vs global; mean vs median; whether velocities/momenta are included). In Sec. 3.1–3.2, report summary statistics (mean$\pm$std or median/IQR) over multiple test realizations and random seeds: trajectory MSE at several times (e.g., $t=1$, $2.5$, $5.0$), maximum/RMS $|\Delta H|$ (clearly defined), reversibility error distribution, and the volume/symplectic diagnostic distribution. Add error bars to Fig. 1 and summary bands/quantiles to Fig. 2, and state the number of runs.
-
Novelty/positioning relative to prior work is not clearly established. The method combines elements common in Hamiltonian Neural Networks, symplectic neural ODE/Symplectic ODE-Nets, and interaction networks/GNN N-body emulators, but the manuscript lacks a clear statement of what is new beyond that combination (Sec. 1–2).
Recommendation: Add a dedicated Related Work section (Sec. 1 or Sec. 2.x) covering HNNs, symplectic neural integrators / Symplectic ODE-Nets, and GNN-based N-body emulation. Then explicitly enumerate the paper’s novel contributions (e.g., the specific Hamiltonian + pairwise potential parameterization choice, the differentiable leapfrog unrolling setup, the radial curriculum for Plummer cores, the specific generalization tests across $N$) and what is inherited. Where appropriate, convert broad claims in Sec. 1 and Sec. 4 into precise, testable statements matched to experiments.
-
The symplecticity/phase-space preservation verification is currently under-specified and partially conceptually incorrect. Sec. 2.4 and Sec. 3.2.3 treat $\det(M)\approx 1$ as evidence/proof of a canonical/symplectic map, but determinant-one only tests volume preservation and is not sufficient for symplecticity; additionally, computing $\det$ of a $6N \times 6N$ Jacobian is numerically delicate and method-dependent (Sec. 2.4; Sec. 3.2.3).
Recommendation: First, correct the claim: $\det(M)\approx 1$ supports volume preservation, not symplecticity. Second, clarify in Sec. 2.4 exactly how $M$ is computed (full Jacobian via autograd vs finite differences; which $N$; which timesteps; how many samples; numerical stabilization; whether $\log|\det(M)|$ is used). Third, if you want to empirically test symplecticity, report a symplectic-condition diagnostic such as $\|M^T J M - J\|$ (with the appropriate $J$) on sampled states, alongside volume diagnostics. Finally, note that (given exact gradients of a separable learned Hamiltonian) leapfrog is symplectic by construction; frame the diagnostic as an implementation/finite-precision sanity check rather than a proof.
-
The scope of the “generalization” claim is overstated relative to experiments. Transfer is shown only from $N=50$ to $N=25$ and $N=100$ within the same data-generation family (virialized Plummer spheres, equal masses, fixed softening, fixed normalization choices). This is a narrow distribution shift and does not yet justify language like “learned a generalizable physical law” (Abstract; Sec. 3.1; Sec. 4).
Recommendation: Narrow the claim in the Abstract/Sec. 4 to “generalizes across particle count within the family of virialized softened Plummer spheres (equal-mass, fixed $\epsilon$).” If feasible, add tests along additional axes: (a) wider $N$ range (including larger $N$), (b) different Plummer scale radii $b$ / density scales, (c) different softening $\epsilon$, and/or (d) different initial virial ratios. Report how rollout error and invariant metrics scale with these shifts.
-
Method specification is incomplete, preventing reproducibility and capacity assessment. Key architectural and training details are missing (MLP depth/width, parameter count, any normalization/residuals, optimizer hyperparameters, batch size, number of updates/epochs, window sampling/stride, random seeds). Loss details are also underspecified (what variables in MSE; whether intermediate steps contribute; exact Hamiltonian regularizer definition) (Sec. 2.2–2.3).
Recommendation: Expand Sec. 2.2 to fully specify the network (layers, hidden dims, activations, output scaling, aggregation, parameter count). Expand Sec. 2.3 to provide full training protocol (Adam hyperparameters, LR schedule, batch size, number of steps/epochs, gradient clipping, seed count, window sampling). Provide the explicit full loss formula, including whether loss is applied at only $t_{n+50}$ or across the rollout, and whether it includes $q$ only or $(q,p)$ (and their relative weighting). An appendix with pseudocode/config tables would be sufficient.
-
Energy regularization and curriculum masking may materially affect the results, but their roles are not isolated. The Hamiltonian regularizer coefficient $\lambda=0.001$ is not justified, and without ablations it is unclear whether energy behavior/reversibility are driven by symplectic integration, the conservative force construction, the explicit energy penalty, or the curriculum (Sec. 2.3; Sec. 3.2.1–3.2.2).
Recommendation: Add ablations in Sec. 3: train/evaluate with (i) $\lambda=0$, (ii) a small grid of $\lambda$ values, and (iii) no curriculum vs curriculum (and potentially different mask radii). Report impacts on trajectory MSE, $|\Delta H|$ (clearly defined), and reversibility error. If compute is limited, at least provide one controlled ablation per component and discuss the outcome in Sec. 3.2.