-
Core conceptual ambiguity: the paper interprets “performance variability” as reflecting clinic instability/disruption (implicitly resilience), but the measured SD/CV of reported annual rates conflates (i) true underlying changes in clinic quality/processes, (ii) within-stratum case-mix shifts, (iii) expected sampling variability of rates given finite denominators (cycle counts), and (iv) reporting artifacts (clearly present for donor eggs in 2022) (Secs. 1, 3.5, 4). Without explicitly modeling denominator-driven uncertainty, the analysis risks labeling statistical noise (or changing composition) as “volatile performance,” and the counterintuitive volume–variability finding is especially hard to interpret under this conflation (Secs. 3.3–3.4.1).
Recommendation: Add an explicit estimand/interpretation paragraph (Sec. 1 and/or Sec. 2.4): clarify whether the goal is (a) observed volatility in published clinic metrics (consumer-facing) or (b) latent instability in underlying clinic performance. If aiming at (b), incorporate denominators/precision into the analysis. Concretely, in Sec. 2.4–2.5 and results (Sec. 3.3–3.4), implement at least one of: (i) compute an “excess variability” measure by comparing observed across-year variance of a rate to its expected binomial sampling variance $p(1-p)/N$ using stratum-specific denominators; (ii) shrink annual rates toward a clinic–stratum mean (empirical Bayes/meta-analytic) before computing across-year variability; or (iii) model annual counts (e.g., live births) as binomial with denominators (retrievals/transfers) and estimate a time-varying clinic component. At minimum, add sensitivity analyses restricting to clinic–strata with sufficiently large denominators in each year (e.g., $N\geq25/50$) and show whether the volume–variability association persists.
-
Pandemic-era / COVID-19 framing is not operationalized: the manuscript analyzes only 2020–2022 and contains suggestive language about “pandemic disruption,” but provides no pre-pandemic baseline and no direct/proxy measures of local pandemic intensity (shutdown timing, case rates, policy) (Sec. 1; Secs. 3.3–3.5, 4). As written, attribution of observed patterns to COVID-era disruption is speculative because similar variability could exist in non-pandemic years.
Recommendation: Revise Sec. 1, Sec. 3.5, and Sec. 4 to frame the study as describing variability during 2020–2022 rather than identifying pandemic effects, unless additional analyses are added. If feasible, extend extraction to include at least one pre-pandemic window (e.g., 2017–2019) and compare variability distributions and volume–variability relationships pre vs. during 2020–2022; describe methods in Sec. 2.4–2.5 and report in Sec. 3.3–3.5. Alternatively, link external COVID intensity indicators at the state/county level and test whether variability is higher in higher-intensity areas.
-
Clinic volume is defined inconsistently across Methods and Results, undermining the central volume–variability findings. Sec. 2.4 defines Avg_Clinic_Volume as the mean Cycle_Count across years within a clinic–stratum (possibly only for years with metric data), while Sec. 3.1 refers to a “maximum” Stratum_Cycle_Count and later reverts to averages (Secs. 2.4 vs. 3.1 vs. 3.4.1). It is also unclear how zeros, missing years, duplicate rows, and multi-stratum clinics are handled; and stratum-specific volume may conflate “size” with case-mix (age/egg-source composition).
Recommendation: In Secs. 2.2–2.4, provide a single precise mathematical definition for: (i) per-year stratum volume; (ii) Stratum_Cycle_Count (if used); and (iii) Avg_Clinic_Volume, including the averaging set (all three years vs only years with non-missing outcomes) and treatment of zero-cycle years. Resolve the Sec. 3.1 “maximum” vs “average” inconsistency and ensure all figures/models use the same definition. Add a sensitivity analysis using an overall clinic-level volume (total cycles across strata) alongside stratum-specific volume, or explicitly limit conclusions to stratum-specific volume (Sec. 3.5, Sec. 4). Also clarify whether volume quartiles (Sec. 3.4.1; Figs. 10–17) are computed within stratum or globally.
-
Donor-egg results are not reliable because the manuscript documents an apparent systemic anomaly in 2022 donor-egg live-birth rates (0$\%$ for all clinics), yet donor-egg SD/CV, correlations, and regressions appear to include 2022 (Secs. 3.2–3.4.1, 3.5). With one year mechanically set to zero, across-year variability becomes largely an artifact of 2020–2021 values and the mean, distorting any donor-egg volume/geography associations.
Recommendation: Pre-specify and implement donor-egg analytic scenarios in Sec. 2.4–2.5: (i) treat 2022 donor-egg outcome values as missing; and/or (ii) restrict donor-egg variability analyses to 2020–2021 (noting $n=2$ limitations); and/or (iii) omit donor-egg variability analyses until the anomaly is resolved. Recompute and report donor-egg descriptive/association results under the anomaly-robust scenario(s) (Sec. 3.3–3.4) and revise wording in the Abstract/Sec. 3.5/Sec. 4 to avoid “consistent across egg sources” claims unless they hold after this fix.
-
Metric extraction and definitions are not documented at a level that supports verification, and there is a key outcome-definition tension: the paper cannot map “$\%$ live birth per transfer” (Secs. 3.1, 3.5), yet Donor_Egg_LB_Rate is later described as “percentage of donor-egg embryo transfer cycles leading to live births,” which sounds transfer-denominator-based and risks being confused with the excluded “per transfer” metric (Secs. 3.1–3.2). This raises concern about denominator correctness for multiple metrics (intended vs actual retrieval; transfer-based outcomes) and threatens interpretability of results.
Recommendation: Add a mapping table (Sec. 2.2–2.3 or Appendix) listing each analytic metric with: exact NASS Topic/Question/Type/Filter/Breakout fields used; numerator and denominator in words; and any transformations. Explicitly define “intended retrieval” vs “actual retrieval” using NASS documentation and confirm which Cycle_Count corresponds to each metric’s denominator. For Donor_Egg_LB_Rate, state the precise numerator/denominator and explain how it differs from (or relates to) the unmappable “$\%$ live birth per transfer.” Briefly document the search/matching logic that failed for “$\%$ live birth per transfer,” so readers can reproduce and assess whether it might exist under an alternate label.
-
Variability estimation is statistically fragile with only 2–3 annual observations per clinic–stratum and with frequent low means/zeros; CV in particular can explode for near-zero means and is undefined when the mean is zero (Secs. 2.4, 3.3, 3.5). The manuscript acknowledges sensitivity but does not quantify instability, specify handling rules for zero means, or show whether key associations are robust to excluding low-mean/low-N strata.
Recommendation: In Sec. 2.4, explicitly state: (i) whether SD uses ddof=0 or ddof=1; (ii) how CV is computed when the mean is 0 (exclude vs set missing vs add epsilon), and how “near-zero” is handled; and (iii) how many clinic–strata are affected. In Sec. 3.3 (and/or Appendix), add robustness checks: (a) exclude clinic–strata with mean success rate below a threshold (e.g., $<1\%$ or $<2\%$) and re-estimate key correlations/OLS; (b) exclude strata with only 2 years of data; (c) consider alternative dispersion measures (e.g., MAD on annual rates; SD/CV after logit transform of proportions with appropriate continuity correction) and show whether qualitative conclusions persist.
-
Inferential modeling choices are misaligned with outcome distributions and diagnostics. CV/SD are nonnegative and right-skewed; diagnostic plots show heteroscedasticity/non-normal residuals and likely influential outliers, yet OLS p-values are still used as primary evidence (Secs. 2.5.4, 3.4.3; Figs. 18–21). The state analysis uses many fixed effects with potentially small per-state sample sizes, and the paper runs many tests (metrics $\times$ strata $\times$ outcomes) with no clear multiple-testing plan (Secs. 2.5.3–2.5.4, 3.4.2–3.4.3).
Recommendation: Either (i) upgrade the modeling strategy or (ii) downgrade inferential claims. Preferably: in Sec. 2.5.4, use heteroscedasticity-robust SEs (e.g., HC3) at minimum; consider modeling $\log(\mathrm{CV}+\epsilon)$ / $\log(\mathrm{SD}+\epsilon)$, Gamma GLM with log link, or robust regression. For geography, consider partial pooling (mixed effects for state) rather than dozens of dummies, or restrict to a smaller pre-specified set of state comparisons. In Sec. 3.4.2–3.4.3, adopt and report a multiple-testing strategy (e.g., FDR within test families) and emphasize effect sizes and uncertainty over isolated $p<0.05$ findings.
-
The central “higher volume $\rightarrow$ higher success-rate variability” result is intriguing but currently under-explained and may reflect mechanical/statistical artifacts (dependence of SD on event counts; CV dependence on mean; regression-to-the-mean; changes in patient mix within stratum; reporting/rounding practices) rather than real instability (Secs. 3.3–3.4.1, 3.5). Without conditioning on mean levels and denominator precision, interpretation remains ambiguous.
Recommendation: In Sec. 3.4.1–3.4.3, add analyses that separate level from variability: include mean success rate as a covariate (or stratify by mean-rate bands) when relating volume to SD/CV; test whether volume associations persist within narrower mean ranges. If possible, add simple clinic-level case-mix proxies using NASS (e.g., distribution of age strata across the clinic) to see whether volume–variability associations attenuate. Update Sec. 3.5/Sec. 4 to more explicitly present alternative explanations and avoid causal wording.
-
Figures and reporting contain multiple consistency/readability issues that reduce actionability and confidence: inconsistent labeling/units (percent vs proportion), mismatches between captions and plotted statistics, p-values shown as 0, conflicting repeated Spearman $\rho$ values (e.g., $\rho=0.436$ in text vs $\rho=0.45$ in Fig. 6 caption), missing sample sizes by group, overplotting, and axes compressed by outliers (Figs. 2–3, 6–17; Sec. 3.3–3.4).
Recommendation: Systematically audit all figures and captions: standardize terminology (CV, SD, volume), units, and rounding (e.g., report Spearman $\rho$ to 2 d.p. everywhere). Never display $p=0$; use $p<1e^{-k}$. Add $N$ per panel/quartile/state in captions or directly on plots. Improve readability (vector/300dpi export, larger fonts, transparency/jitter, axis breaks/insets for extreme outliers). Ensure diagnostic figures (Figs. 18–21) are tied to concrete modeling changes (robust SEs/transformations) rather than presented as a stand-alone caveat.