-
Reproducibility and provenance: the paper does not provide publication-grade details on how each posterior sample set was produced and whether all five analyses are comparable beyond the waveform approximant choice (Sec. 2.1–2.2). The use of internal filesystem paths obscures provenance. Critically, cross-model posterior differences can be caused (or amplified) by differences in priors, calibration version, PSD estimation, data segments, detector network, reference frequency, parameterization choices, or sampler settings—not only waveform physics. The current description does not convincingly isolate “waveform-model systematics” from “analysis-configuration differences.”
Recommendation: Rewrite/expand Sec. 2.1 into a reproducible data+PE configuration specification. For each model, explicitly state: (i) source of posteriors (LVK public release DOI/URL vs your own PE), (ii) detectors used, GPS time window, PSD method, calibration version/marginalization, (iii) priors (including mass, distance/redshift, spin magnitude/tilt, orientation), (iv) sampler and settings, and (v) final number of (effective) posterior samples used after burn-in/thinning/reweighting. Add a compact table summarizing “what is held fixed vs what changes” across models. If priors differ in any way, reweight posteriors to a common prior before computing divergences (or explicitly quantify the effect by comparing divergences pre/post reweighting). Also explicitly confirm cosmology/source-frame conversions are identical across models when comparing source-frame masses and redshift.
-
Multivariate JSD via multivariate KDE is under-specified and potentially unstable in moderate dimension (up to 6D in the Individual Spin & Orientation subspace), yet it underpins key conclusions about which physics drives disagreements (Sec. 2.4.2, Sec. 3.3). In $6$D, KDE bandwidth choice and sample size can dominate density estimates and thus JSD; JSD values “near maximum” may reflect estimator artifacts rather than genuine separation. The manuscript also lacks uncertainty quantification for JSD estimates.
Recommendation: In Sec. 2.4.2, provide the exact mathematical definition of the multivariate JSD used (including log base and normalization) and fully specify the estimator: KDE kernel, bandwidth selection rule (and whether shared across models), boundary handling (especially for bounded spins and angular variables), and numerical integration/Monte Carlo procedure used to compute JSD from KDEs. Report sample sizes and effective sample sizes per model per subspace. Add uncertainty and stability diagnostics: bootstrap over posterior samples to provide confidence intervals on each pairwise/subspace JSD, and a sensitivity study varying bandwidth (and/or using subsampling) to demonstrate conclusions persist. Consider adding (even as a check in an Appendix) at least one alternative two-sample/divergence approach that is more robust in higher dimensions (e.g., kNN-based divergence estimators, MMD/energy distance, sliced Wasserstein, or classifier-based two-sample tests), and show that the ranking of “most discrepant subspace” is consistent.
-
Robustness thresholds (max pairwise JSD and median-spread cutoffs) are central to the paper’s headline conclusion (“no key parameter is robust”), but the thresholds are currently ad hoc/uncalibrated and inconsistently stated (0.01 vs 0.05) (Sec. 2.5.1, Sec. 3.4, Sec. 4.2–4.3; Table 3). Without calibration, it is unclear whether the robustness classification is scientifically meaningful or overly stringent for this event class.
Recommendation: In Sec. 2.5.1, state a single, unambiguous robustness decision rule and use it consistently everywhere (including Abstract, Sec. 3.4, Table 3, Sec. 4.2–4.3). Then calibrate/justify thresholds: (i) provide a sensitivity analysis over plausible cutoffs (e.g., JSD $\in \lbrace 0.02, 0.05, 0.1 \rbrace$ and median spread $\in \lbrace 5\%, 10\%, 20\% \rbrace$) and report how many/which parameters change category; (ii) complement JSD with an effect-size metric tied to inference impact, e.g., median shifts in units of a pooled posterior standard deviation or overlap of $90\%$ credible intervals. If possible, add an injection-based calibration (even one or two representative injections in the GW231123 regime) linking typical waveform-systematics-induced JSD values to known biases relative to truth. Temper the strength of the global conclusion if it hinges on one particular threshold choice.
-
UMAP is used as a central diagnostic to claim clustering (core group vs isolated phenomenological models) and to motivate interpretation, but UMAP’s stochasticity and hyperparameter dependence are not sufficiently documented or tested (Sec. 2.3, Sec. 3.2). As presented, the embedding risks being interpreted as stronger evidence than warranted; distances and separations in $2$D may not correspond to meaningful separation in the original $13$D space.
Recommendation: In Sec. 2.3, fully document UMAP settings: parameter list, standardization (joint across all models vs per model), distance metric, handling of periodic/angular parameters (ideally via $\sin/\cos$ transforms before z-scoring), n_neighbors, min_dist, initialization, epochs, and random seed(s). In Sec. 3.2, add a stability analysis: repeat embeddings across multiple seeds and a small grid of (n_neighbors, min_dist), and show the qualitative grouping persists. Add at least one quantitative check: silhouette score by model label in the embedding and/or a two-sample distance/test computed directly in the original standardized space (to avoid over-reliance on 2D visuals). Rephrase interpretation to clearly distinguish “visual diagnostic” from “quantitative discrepancy metric,” pointing readers to the divergence results (Sec. 3.1, Sec. 3.3) for the main quantitative claims.
-
The manuscript frequently moves from “subspace posteriors differ” to causal attributions to specific waveform-physics ingredients (precession implementation, higher-mode content, merger–ringdown modeling), but the presented analyses primarily establish correlation/association, not causation (Sec. 2.4.3, Sec. 3.3.2–3.3.4, Sec. 4.2–4.3). Without controlled comparisons, readers may overinterpret these as definitive mechanistic explanations.
Recommendation: In Sec. 2.4.3 and Sec. 3.3, explicitly label physical explanations as hypotheses (“consistent with”) unless supported by additional tests. Add a concise table listing each model’s salient physics ingredients (precession scheme, mode content, calibration region, ringdown treatment) and relate these to observed discrepancy patterns via structured comparisons (e.g., within-group vs between-group JSD among models sharing features). Strongly consider adding at least one validation exercise: a controlled injection study (or a small set) where the injected signal is known and where changing a single modeling ingredient is expected to predominantly affect a specific subspace; demonstrate the decomposition recovers this behavior. If injections are out of scope, narrow/temper causal language in Sec. 4.2–4.3.
-
The “consensus posterior” construction via concatenating samples across models is not a generally valid inference operation unless explicitly framed as Bayesian model averaging with defined model weights; otherwise it can misrepresent uncertainty and can be misleading even if parameters appear ‘robust’ (Sec. 2.5.3).
Recommendation: Revise Sec. 2.5.3 to (i) explicitly define the mathematical object you intend (e.g., an equally weighted mixture over model-conditioned posteriors), (ii) state model weights and justify them (equal weights vs evidence-based weights vs prior model probabilities), and (iii) clarify limitations. If the intent is diagnostic, label it clearly as a visualization/summary tool, not a principled posterior. Optionally, add a short comparison to formal Bayesian model averaging: $p(\theta \mid d)=\sum_m p(\theta \mid d, m) \, p(m \mid d)$, and explain what is assumed/omitted (e.g., evidences not computed).
-
Physical plausibility/consistency checks are not sufficiently developed given the extreme cross-model spreads reported for some derived quantities (e.g., large shifts in source-frame masses and redshift) (Sec. 3.1, Sec. 3.4). These may be real for a short high-mass signal, but they also commonly indicate differences in priors, cosmology/source-frame conversion, or parameter-definition mismatches across posterior files.
Recommendation: Add explicit consistency checks and documentation: confirm that (i) all posteriors use the same cosmology when reporting redshift/source-frame masses, (ii) detector-frame vs source-frame quantities are not mixed, and (iii) derived parameters (e.g., remnant properties) are computed using the same recipe across models. Provide a short “sanity check” subsection (end of Sec. 2.1 or start of Sec. 3.1) documenting that prior supports match and that the observed posterior differences are not prior-driven (e.g., show prior overlays for a few key parameters, or report prior-to-posterior information gain per model).