-
The manuscript’s core comparisons are covariance-based (and PCA is linear), but it does not establish that the relevant posteriors are close enough to unimodal/elliptical in the chosen coordinates for second-moment summaries to be representative. For a high-mass, strongly precessing event, posteriors can be skewed, heavy-tailed, bounded, curved, or multimodal—especially in orientation variables—so two posteriors may share similar covariances while differing substantially in shape/topology, or vice versa (Secs. 2.3–2.4, 3.2–3.4).
Recommendation: In Sec. 2.3 and Secs. 3.2–3.4, add explicit posterior-shape diagnostics per model for the most-discussed parameters (masses/redshift or distance; $\cos\theta_{jn}$; $\phi_{jl}$; tilts): e.g., 1D skewness/kurtosis (or robust alternatives), and a small set of representative 2D projections that show whether degeneracies are approximately linear/elliptical or visibly curved/multimodal. Then, either (i) clearly scope claims as “second-moment/correlation-structure comparisons,” or (ii) add one complementary sample-based multivariate discrepancy check for at least one key pair (e.g., NRSur7dq4 vs IMRPhenomXO4a): sliced Wasserstein, energy distance/MMD, or a classifier two-sample test on the same 9D space, to demonstrate that the headline conclusions (clustering; orientation-driven mismatch) persist beyond covariance summaries.
-
Per-model standardization (StandardScaler applied separately to each waveform posterior) fundamentally changes what is being compared: it removes cross-model mean shifts and forces each marginal variance to $\sim 1$, so the SPD-manifold analysis and $\Delta C$ largely probe correlation structure rather than covariance in physical units. Several interpretations (e.g., near-zero diagonals of $\Delta C$, statements about “complete local quadratic approximation”) read as empirical findings rather than consequences of the preprocessing choice (Secs. 2.3.1, 2.4.1, 3.4; also wording around p.4 and p.9).
Recommendation: Make the comparison target explicit in Sec. 2.3.1 and Sec. 2.4.1: if variables are standardized per model, refer to the resulting matrices as (approximately) correlation matrices and state that diagonal variance differences are not identifiable by construction. Revise the “complete local quadratic approximation” language to reflect ‘shape up to axis rescaling’ rather than full quadratic form in original coordinates. If scientific conclusions involve differences in absolute uncertainties or mean shifts, add a parallel analysis: (i) compute SPD distances on unstandardized covariances in consistent units/coordinates (or using a shared/pool scaler), and/or (ii) separately compare mean vectors and marginal variances across models (e.g., a table/plot of $\sigma$ differences before standardization). Clearly separate results that come from correlation-structure differences versus scale/location differences.
-
Angular/topological variables are treated with Euclidean linear tools without sufficient specification of conventions and wrap handling, even though the main discrepancies are attributed to orientation correlations (notably $\cos\theta_{jn}$–$\phi_{jl}$). For periodic angles ($\phi_{jl}$/$\phi_{j1}$) and bounded variables ($\cos\theta_{jn}$, $\cos\text{tilt}_i$), naive covariances can be dominated by the branch cut / wrapping choice, leading to artificial correlations or model-dependent artifacts (Secs. 2.1, 2.3.4, 3.2.2, 3.4). There is also notation inconsistency ($\phi_{jl}$ vs $\phi_{j1}$).
Recommendation: In Sec. 2.1, provide explicit definitions, ranges, and reference-frame conventions for $\cos\theta_{jn}$, $\phi_{jl}$ (choose one symbol consistently), $\cos\text{tilt}_1$, $\cos\text{tilt}_2$, and describe how samples from all pipelines are transformed into one common convention before PCA/covariances. In Sec. 3.4, demonstrate robustness of the key orientation-correlation findings under a wrap-safe representation: e.g., replace $\phi_{jl}$ with $(\sin\phi_{jl}, \cos\phi_{jl})$ (and update the 9D set accordingly), or apply a documented unwrapping procedure anchored to a mode/median; then show that the identified large $\Delta C$ entries and model-distance ordering persist. If the representation changes dimensionality, state this clearly and (if needed) present the angular-robust check as a focused appendix/supplement.
-
Key parameter-estimation (PE) configuration details are missing, limiting reproducibility and making it difficult to attribute differences to waveform physics rather than run-to-run analysis choices (priors, PSD estimation, calibration marginalization, $f_{\rm low}$/$f_{\rm high}$, reference frequency, sampler settings, reweighting, etc.) (Secs. 2.1–2.2). This also weakens the discussion that attributes differences to frequency-domain approximations (Secs. 3.5.2, 4).
Recommendation: Add a dedicated PE-setup subsection (e.g., Sec. 2.1.1) listing: detectors and data segment (GPS, duration), PSD estimation method, calibration-uncertainty treatment, frequency bounds and reference frequency, priors for the nine core parameters (and any fixed cosmology used to map distance$\leftrightarrow$redshift, if applicable), sampler/inference engine and settings, any reweighting, and convergence diagnostics. Explicitly state that all waveform-model runs used identical settings except for the waveform model (or enumerate differences and assess their likely impact). In Secs. 3.5.2 and 4, temper causal claims about specific waveform approximations unless supported by targeted controls; otherwise frame them as hypotheses.
-
Uncertainty/robustness of the quantitative metrics is not assessed. Covariances, PCA directions, and SPD distances can be sensitive to finite effective sample size (ESS), autocorrelation, and near-singular covariance estimation in 9D; without uncertainty bands, it is unclear whether distance differences (e.g., 4.17 vs 3.85) or alignment differences are meaningful (Secs. 2.3–2.4, 3.2–3.3). The SPD requirement is assumed; sample covariance is only guaranteed PSD, not necessarily strictly PD (Sec. 2.4.2).
Recommendation: Report per-model posterior sample counts and (at minimum) an ESS estimate (Sec. 2.1). Add bootstrap/jackknife (or repeated subsampling) to quantify uncertainties on: explained-variance curves/intrinsic dimensionality (Fig. 2), leading PC loadings/alignment (Fig. 4), and pairwise Riemannian distances (Fig. 5/related). State the condition for SPD and what you do if matrices are ill-conditioned (e.g., shrinkage regularization $C\rightarrow C+\epsilon I$; monitor condition numbers) (Sec. 2.4.2). Present distance values with uncertainty (e.g., mean$\pm$sd across bootstrap) and comment on stability of the model clustering and the “most distant” model conclusion (Sec. 3.3).
-
The PCA alignment methodology may be misleading because it matches the $k$-th PC across models by index and uses $|\text{PC}_k^A\cdot\text{PC}_k^B|$, but if eigenvalues are near-degenerate, directions can rotate within the dominant subspace and the notion of a uniquely defined ‘PC1 vs PC1’ comparison breaks down (Secs. 2.3.4, 3.2.3).
Recommendation: In Sec. 3.2 (and/or Fig. 2), show eigenvalue spectra (not only cumulative variance) to indicate separation/degeneracy. Complement the current alignment plot with a subspace-based comparison: principal angles between the top-$k$ subspaces, or Procrustes alignment within the top-$k$ space. When discussing misalignment, explicitly note possible near-degeneracy and interpret results at the subspace level when appropriate.
-
Some astrophysical interpretations and generalizations overreach the presented evidence, especially formation-channel implications tied to $\chi_{\rm eff}$ when $\chi_{\rm eff}$ is shown to be waveform-model dependent for this event; additionally, results are based on a single event, so ‘population-level’ implications are not directly supported (Secs. 3.1, 3.5.1–3.5.2, 4).
Recommendation: In Secs. 3.5 and 4, clearly separate (i) methodological conclusions likely to generalize (the multivariate comparison toolkit) from (ii) event-specific numerical patterns (e.g., IMRPhenomXO4a being the most discrepant). Soften or qualify formation-channel language and emphasize limitations from waveform dependence, priors, and selection effects. If feasible, add a brief roadmap for applying the framework to multiple events (even as future work) and cite relevant population/systematics studies for context.