-
Insufficient description of the *parameter-estimation provenance and inference context* makes it hard to attribute differences to waveform systematics rather than analysis configuration. The manuscript operates on posterior samples but does not clearly document (i) where they come from (public release vs in-house), (ii) whether *priors are identical across models* (mass, spin magnitude/tilt, distance/orientation), (iii) whether the same likelihood/PSD/calibration marginalization/data segment were used, and (iv) sampler settings and convergence diagnostics (Sec. 2.1). If priors differ even slightly, JS divergences and “model-dependent” conclusions can be confounded.
Recommendation: Add a dedicated subsection in Sec. 2.1 (e.g., “Parameter-estimation inputs”) documenting for each waveform model: data segment, detector network context, PSD estimation, calibration uncertainty treatment, likelihood form, reference frequency and parameter conventions, sampler and settings, convergence metrics (ESS / stopping), and the *exact priors* used. State explicitly whether priors are identical across models; if not, either (i) reweight posteriors to a common prior where feasible, or (ii) quantify and discuss the expected impact on comparisons (especially for $\chi_p$, $\chi_{\rm eff}$, and orientation). Provide citations/run IDs/links for the posterior files.
-
The JS divergence calculations are central but not specified with enough rigor to audit, reproduce, or interpret quantitatively (Sec. 2.1.2, Sec. 2.3.2, Sec. 3.1, Sec. 3.3). Missing items include: the exact JS definition implemented (continuous vs discretized; log base and range), KDE implementation details (bandwidth rule, covariance structure), evaluation grid/bounds/resolution, normalization after discretization, treatment of zero-probability bins (regularization $\epsilon$), and boundary effects for bounded parameters (e.g., $\cos\theta_{jn}\in[-1,1]$, spin magnitudes in $[0,1]$). As written, strong statements based on values like JS $\sim 0.6$–$0.7$ are difficult to calibrate or trust without estimator uncertainty and bias checks.
Recommendation: In Sec. 2.1.2 and Sec. 2.3.2, provide the explicit mathematical JS formula used (including log base) and the numerical procedure: KDE method, bandwidth selection, grid definition (shared support across model pairs), grid resolution, renormalization, and zero-handling. Add uncertainty/sensitivity estimates: bootstrap over posterior samples and/or vary KDE bandwidth (and, for bounded variables, demonstrate robustness via reflection/logit-transform/Beta-kernels or justify why boundary bias is negligible). If JS values are used as “high/low” indicators, define thresholds relative to the JS maximum under your log base. Consider adding a non-KDE cross-check (e.g., histogram-based JS at multiple binning levels or a kNN-based divergence estimator) to demonstrate that qualitative conclusions are stable.
-
UMAP is heavily relied upon to infer three “islands” and to compute model-to-model JS divergences in the embedded space, but embedding configuration, stability, and interpretability are not demonstrated (Sec. 2.3.1, Sec. 3.3–3.4). UMAP can create apparent separations depending on hyperparameters ($n_{\rm neighbors}$, $min\_dist$, metric), random seed, and density differences across models; additionally, UMAP axes are not uniquely interpretable, and distances are not likelihood-preserving. The manuscript currently risks over-interpreting island structure and UMAP-space JS as “global posterior similarity.”
Recommendation: In Sec. 2.3.1, report *all* UMAP hyperparameters ($n_{\rm neighbors}$, $min\_dist$, metric, $n_{\rm epochs}$, $random\_state$/seed, any non-default settings) and justify choices. In Sec. 3.3 or an Appendix, add stability tests across multiple seeds and a reasonable hyperparameter grid; quantify stability (e.g., Procrustes-aligned embeddings; correlation of inter-point distance matrices; trustworthiness/continuity). Explicitly caution that UMAP axes are unitless/arbitrary, and quantify axis–parameter relationships via Spearman/Pearson correlations (or local regression) rather than deterministic claims. For UMAP-space JS (Sec. 3.3), either justify it with embedding-fidelity diagnostics or reframe it as heuristic, and add at least one complementary *high-dimensional* two-sample measure not relying on UMAP (e.g., classifier two-sample test AUC, MMD/energy distance) to corroborate which model pairs are genuinely distinguishable in $14{\rm D}$.
-
HDBSCAN clustering underpins key claims about bimodality/unimodality and “statistically significant modes,” but parameter choices, stability, and the interpretation of “noise” are not adequately documented (Sec. 2.2, Sec. 3.2). The manuscript notes $min\_cluster\_size$ was “carefully tuned” but does not report the chosen values per model or explore sensitivity to $min\_samples$/metric/standardization. Additionally, statements conflating “one cluster” with “all noise” make unimodality claims ambiguous; in HDBSCAN these are distinct outcomes, and “all noise” does not imply unimodality—only that density-based clusters were not identified under the chosen settings.
Recommendation: In Sec. 2.2.2–2.2.3, report the full HDBSCAN configuration per model ($min\_cluster\_size$, $min\_samples$, metric, $cluster\_selection\_method$, preprocessing). In Sec. 3.2 (or Appendix), provide a robustness study: vary $min\_cluster\_size$/$min\_samples$ and (if applicable) distance metric; report number of clusters, noise fraction, cluster stability/persistence scores, and how key summaries (e.g., medians/90% CIs of $\cos\theta_{jn}$, $\chi_{\rm eff}$, $\chi_p$, masses) change. Rewrite unimodality language to distinguish “one robust cluster” from “no clusters found (all noise).” If inclination bimodality is a central physical claim, consider adding a direct bimodality diagnostic on $\cos\theta_{jn}$ (e.g., dip test/mixture fit) alongside HDBSCAN.
-
The definition of the 14D feature space used for distances/clustering/UMAP is potentially inconsistent with the stated goal of comparing *physical* source-parameter structure (Sec. 2.1.1, Sec. 2.2, Sec. 2.3). In particular: (i) including 'log_likelihood' as a coordinate can dominate geometry and create separations that reflect fit quality or sampler artifacts rather than physical degeneracy; (ii) periodic variables (e.g., $phi_{jl}$) appear to be z-score standardized linearly without circular handling, which can distort distances near wrap-around.
Recommendation: Decide and document a principled feature set for geometric comparisons. Prefer excluding 'log_likelihood' from the clustering/UMAP feature space (or analyze it separately as an outcome variable); if it is retained, explicitly motivate it and quantify its influence (e.g., repeat key results with/without it and show islands/clusters persist). For angular parameters ($phi_{jl}$ and any others), use a circular embedding (sin/cos) or otherwise justify that wrap-around effects do not matter for the sampled support. Report the final feature list used for each method (JS/KDE, HDBSCAN, UMAP) in Sec. 2.
-
Claims that $\chi_p$ is “robust” while $q$/$\chi_{\rm eff}$/$\cos\theta_{jn}$ are “model-dependent” are plausible but not supported by a consistent quantitative framework, nor clearly separated into data-driven findings vs hypothesized waveform explanations (Sec. 3.4, Sec. 4.1–4.3). Additionally, the physical interpretation would be strengthened by showing whether different islands/modes have comparable fit quality (since 'log_likelihood' is available) and by demonstrating that high $\chi_p$ is not primarily prior-driven for this short/high-mass signal.
Recommendation: In Sec. 3.4 and Sec. 4.1–4.2, add concise quantitative summaries across models for key parameters: range of posterior medians, 90% CI overlaps, and 1D JS distributions (e.g., median and interquartile range of pairwise JS). Define a criterion for “robust” vs “model-dependent” and apply it consistently. Add prior-vs-posterior comparisons (or KL to prior) for $\chi_p$, $\chi_{\rm eff}$, and $q$ for each model to substantiate robustness. Use 'log_likelihood' (or evidence if available) to check whether different UMAP islands correspond to similarly good fits; if one island systematically fits worse, state this explicitly and adjust interpretation. Expand Sec. 4.2–4.3 to connect observed discrepancies to specific known waveform-model differences (calibration ranges, precession/higher-mode treatments, time vs frequency domain approximations), with citations, clearly labeling hypotheses vs measured effects.
-
The manuscript sometimes reads as if broad conclusions about waveform systematics and manifold-based comparison methods generalize beyond this case, but the study is a single-event analysis (Sec. 1, Sec. 4.3). Given GW231123’s likely high-mass/short-duration nature (merger–ringdown dominated), it may be an especially challenging or atypical case; the generality and practical “workflow” implications for catalogs/populations are not yet established.
Recommendation: Reframe Sec. 1 and Sec. 4.3 more explicitly as a *case study* of GW231123. Add a short limitations-and-generalization paragraph (Sec. 4) describing what properties of GW231123 make it a strong stress-test (e.g., short signal, strong degeneracies) and what is needed to extend the approach: computational scaling, event selection criteria, and minimal robustness diagnostics to run routinely. Where you draw broader lessons, make clear which are methodological proposals vs empirically demonstrated general patterns.