-
BMA weights derived from BIC are not sufficiently justified for GW parameter estimation here, and the BIC definition/implementation is ambiguous (Sec. 2.4.1, Sec. 3.4.1; Table 4). In GW likelihoods the notion of the number of data points $n$ (and BIC’s asymptotic i.i.d. assumptions) is non-trivial, and it is not clearly established that $n$ and $k$ are identical across waveform models. Since Table 4 weights materially determine the BMA meta-posterior and downstream astrophysical statements (Sec. 3.4.2, Sec. 4), the quantitative conclusions are currently sensitive to an approximation whose validity is not demonstrated. In addition, the manuscript’s BIC equation conflicts with the stated meaning of $L_{\rm max, i}$ (“maximum log-likelihood”), risking a log-of-a-log inconsistency.
Recommendation: In Sec. 2.4.1, (i) define unambiguously whether $L_{\rm max, i}$ denotes the maximum likelihood $L(\hat{\theta})$ or the maximum log-likelihood $\ell_{\rm max} = \ln L(\hat{\theta})$, and write BIC consistently as $\mathrm{BIC} = k \ln n - 2 \ln L_{\rm max}$ or $\mathrm{BIC} = k \ln n - 2\ell_{\rm max}$; (ii) give a precise operational definition of $n$ for the analysis (or explain why $n$ cancels in all comparisons you actually use), and specify $k$ per model if it differs. Then, in Sec. 3.4.1–3.4.2, add sensitivity analyses that recompute key BMA outputs (Table 5 parameters and any key posterior probabilities) under alternative weighting choices: equal weights; weights based on $\Delta\ell_{\rm max}$ only; and (if available from the PE runs) weights based on proper log-evidences (nested sampling $\log Z$ / thermodynamic integration) for at least a subset of models. Explicitly report how strongly $m_1$, $z$, $\chi_{\rm eff}$, $\chi_p$, and $\cos\theta_{\rm JN}$ shift under these alternatives, and qualify claims that rely on the BIC-weighted choice.
-
Feature-based attribution via Spearman correlations is statistically underdetermined and potentially confounded with only five waveform models, and the analysis likely inflates the effective sample size by treating multiple parameters per model as independent (Sec. 2.3, Sec. 3.3; Fig. 3). With one-hot binary features that are collinear (e.g., Phenom vs frequency-domain; EOB vs time-domain; NR surrogate as a singleton), reported correlations (e.g., $\rho \approx 0.3$–$0.4$) cannot be interpreted robustly as “drivers,” and no uncertainty, multiple-testing control, or leave-one-model-out robustness is shown.
Recommendation: Reframe Sec. 2.3 and Sec. 3.3 as primarily descriptive unless stronger statistical support can be added. Concretely: (i) state the true number of independent units (models) and clarify the construction of the discrepancy dataset (how many JS/SWD values per model; how dependencies across parameters are handled); (ii) report uncertainty on correlation estimates via permutation tests that respect grouping by model and/or a leave-one-model-out analysis; (iii) quantify feature collinearity (feature–feature correlations) and avoid interpreting correlated features as separable causes. Consider replacing pairwise Spearman bars with simpler, actionable summaries: group-by-feature comparisons (e.g., “HOM included vs not,” “full precession vs simplified”) and/or a very low-dimensional, strongly regularized regression with model-level clustering, clearly labeled as exploratory. Temper causal language in Sec. 3.3 and Sec. 4 accordingly.
-
Reproducibility and PE-configuration parity are not adequately documented, making it hard to attribute posterior differences to waveform physics rather than to analysis setup differences (Sec. 2.1–2.4, Sec. 3.1). The manuscript does not clearly state whether priors, cosmology/source-frame conversions, PSDs, calibration marginalization, frequency bounds ($f_{\rm low}/f_{\rm high}$), data conditioning, and sampler settings/convergence diagnostics were identical across all waveform runs. Maximum log-likelihood comparisons (Table 4) are also difficult to interpret without evidence that each run reliably explored the relevant likelihood maxima.
Recommendation: Add a dedicated subsection (Sec. 2.1 or an Appendix) that lists, for every waveform-model PE run: priors (including spin tilt and magnitude priors), cosmology choices for $z$/source-frame masses, PSD estimation procedure, calibration handling, $f_{\rm low}/f_{\rm high}$, data segment length/windowing, sampler and stopping criteria, and convergence diagnostics (e.g., effective sample size; any chain diagnostics used). Explicitly confirm parity across models (or list deviations). Provide the number of posterior samples per model used downstream (Sec. 2.1.2, Sec. 3.1.1) and document any thinning/reweighting. This is essential to support claims in Sec. 3.2–3.4 that differences are waveform-driven.
-
The discrepancy metrics (JS divergence, SWD, and UMAP) are central to the paper’s conclusions but lack sufficient specification and robustness assessment (Sec. 2.2.1–2.2.2, Sec. 3.2). JS divergence computed from KDE marginals can depend strongly on bandwidth choice and boundary handling for bounded parameters ($\chi_p \in [0,1]$, $\cos\theta_{\rm JN} \in [-1,1]$); UMAP is stochastic and hyperparameter-dependent; SWD depends on parameter scaling/transformations and number of projections. Without robustness checks or uncertainty estimates, rankings in Tables 2–3 and clustering impressions in Fig. 2 are hard to interpret quantitatively.
Recommendation: In Sec. 2.2.1–2.2.2, fully specify: KDE library, kernel, bandwidth rule (which one, exactly), grid/support, and boundary handling (reflection, transforms, or bounded KDE) for $\chi_p$ and $\cos\theta_{\rm JN}$; for SWD, the exact parameter vector used, any transforms (e.g., log masses), normalization/standardization, the number of random projections, and the random seed(s); for UMAP, $n_{\rm neighbors}$, min_dist, metric, preprocessing/standardization, and random_state. Then add minimal robustness checks: (i) recompute JS with at least one alternative estimator (e.g., histogram-based JS on a common binning or a kNN-based divergence estimator) and/or alternative bandwidth; (ii) show UMAP stability across several seeds/hyperparameters (qualitatively is fine, but state what changed); (iii) report SWD variability via bootstrap resampling of posterior samples or repeated random projections. Summarize robustness outcomes in Sec. 3.2 (or an Appendix) so the discrepancy conclusions are auditable.
-
Waveform-model content and feature encoding are not sufficiently explicit and may contain internal inconsistencies (Sec. 2.1.1–2.1.3, Sec. 3.3). The text characterizes certain models as lacking HOMs or comprehensive precession, but at least one model name (e.g., IMRPhenomXPHM) commonly denotes inclusion of precession and higher modes; if your specific configuration restricted modes or physics, it must be stated. Without an explicit per-model description (modes included, precession implementation, calibration range), the feature matrix and the interpretation of Sec. 3.3 are hard to verify.
Recommendation: Replace abstract/meta references with a self-contained table in Sec. 2.1.1–2.1.3 listing, for each waveform model as actually run: domain (time/frequency), family, calibration approach and validity range, precession treatment, and explicit higher-mode content (list $(\ell,m)$ modes and any flags/settings used). Cite the relevant waveform papers/software documentation. Ensure Sec. 3.3 statements match this table precisely (e.g., distinguish “no HOMs” from “restricted HOM set,” or “twist-up precession” from “full precession”).
-
The paper treats NRSur7dq4 as a de facto ground truth reference for defining discrepancies and feature attribution (Sec. 2.2–2.3, Sec. 3.2–3.3) without a sufficiently critical discussion of reference dependence and surrogate validity for GW231123-like posteriors. If parts of the posterior explore regions near/outside the surrogate’s training domain (e.g., in $q$ or spins), then divergences may reflect reference limitations as much as other models’ limitations.
Recommendation: In Sec. 1 or early in Sec. 2.2, summarize NRSur7dq4’s training/calibration domain and assumptions (with citations) and assess whether GW231123 posteriors approach domain edges. Explicitly acknowledge that divergences relative to NRSur7dq4 conflate differences in other models with possible surrogate imperfections. If feasible, add a robustness check in Sec. 3.2–3.3 using an alternative reference (e.g., SEOBNRv5PHM) and/or report pairwise divergence summaries (not only vs NRSur7dq4). Temper any “ground truth” phrasing accordingly.
-
There is an unresolved tension between the discrepancy diagnostics (which identify some models as globally most discrepant) and the subsequent BMA (which can still assign those models substantial weight), but the impact of including/excluding these models is not quantified (Sec. 3.2 vs Sec. 3.4). This makes it difficult to interpret the BMA meta-posterior as a mitigation of waveform systematics rather than an averaging over potentially inconsistent inferences.
Recommendation: In Sec. 3.4.1–3.4.2, add an explicit robustness study: recompute the meta-posterior and Table 5 summaries after excluding IMRPhenomXO4a and/or IMRPhenomXPHM (or after down-weighting based on a stated “model adequacy/physics completeness” prior). Report shifts in $m_1$, $z$, $\chi_{\rm eff}$, $\chi_p$, and $\cos\theta_{\rm JN}$ and discuss what practitioners should conclude when a high-likelihood but globally discrepant model dominates weights. Consider discussing model stacking / predictive approaches as an alternative to BIC-BMA, even if only as future work.
-
Several astrophysical conclusions (PISN mass-gap placement; dynamical/hierarchical formation; strength of spin-orbit misalignment claims from $\chi_p$) are stated more strongly than is justified given the remaining waveform dependence, the limited model set, and the approximate nature of the BMA weights (Sec. 3.4.2, Sec. 4). In particular, “in the pair-instability mass gap” claims should quantify posterior probability relative to a stated threshold and show sensitivity to waveform choice and to $z$/cosmology assumptions.
Recommendation: In Sec. 3.4.2 and Sec. 4, moderate language to reflect residual systematic uncertainty. Quantify key probabilities rather than categorical statements, e.g., $P(m_1 > m_{\rm gap})$ for one or more literature thresholds (state which), and $P(M_f > 100\,M_{\odot})$ for the IMBH claim. Show these probabilities per-model and under the alternative weighting schemes requested above. Provide a compact per-model table/figure for $\chi_p$ (medians and credible intervals) to support claims of robust strong precession, and phrase formation-channel inferences as suggestive/consistent rather than definitive.