-
With one input and one output, the input-oriented VRS (BCC) DEA is often very close to a (piecewise) transformation of the output/input ratio—here essentially the live-birth rate—so the analysis may be rediscovering the distribution of live-birth rates (and its zero mass) rather than identifying multi-factor technical efficiency (Sec. 1, Sec. 2.3, Sec. 3.3–3.4). Because the output is constructed as (live-birth rate $\times$ cycle count), the model also mechanically ties output to input, further compressing what DEA can learn beyond “which clinics have the best reported rate within a stratum.”
Recommendation: In Sec. 1 and Sec. 4.3, explicitly discuss what DEA adds in a $1\times 1$ setting and why DEA (rather than direct rate benchmarking) is the preferred tool here. Add an empirical diagnostic in Sec. 3.3 (or a new Sec. 3.7): within each year$\times$age stratum, plot/correlate DEA efficiency vs. live-birth rate (live births per intended retrieval) and explain deviations (if any). If the intended contribution is true efficiency benchmarking, add a robustness/extension with additional outputs and/or inputs available in NASS (or clearly state they are unavailable), so DEA is substantively necessary rather than a reparameterized rate ranking (Sec. 2.1–2.3).
-
Output construction has a critical unit/rounding problem that can materially distort efficiency scores, especially for low-volume clinics and older age groups (Sec. 2.2, Sec. 3.2.3–3.2.4, Sec. 3.5). The manuscript defines live births as (“$\%$ live births per intended retrieval”) $\times$ (cycle count) without clearly dividing by $100$, and the unstructured report indicates the result is then rounded to an integer. This can create many artificial zeros whenever expected births are $<0.5$ and compound discretization error if the published percentage is itself rounded. The strong results about ‘zero live births’ (Sec. 3.5) may therefore be driven partly by construction artifacts rather than clinical performance.
Recommendation: In Sec. 2.2, define the output precisely and fix units: $\text{Output\_LiveBirths} = (\text{LBRate\_percent}/100) \times \text{Cycle\_Count}$ if LBRate is a percent. Do not round outputs; DEA permits nonnegative real outputs. Re-run the full analysis with continuous outputs and report how (i) the fraction of zero-output DMUs changes, (ii) efficiency distributions change, and (iii) the Sec. 3.5 sensitivity conclusions change. As an explicit robustness check (Appendix or Sec. 3.7), compare efficiency under (a) continuous outputs, (b) current integer-rounded outputs, and (c) alternative rounding rules (floor/ceiling) to quantify sensitivity.
-
Interpretation of input-oriented efficiency is not well aligned with ART clinical objectives, and the chosen orientation/returns-to-scale assumptions are under-justified (Sec. 1, Sec. 2.3, Sec. 4.3). Input orientation asks how much intended retrieval cycles could be reduced holding live births fixed—yet cycles are patient treatments/demand-driven rather than an easily “minimized” resource, and many readers will interpret the results as ‘quality’ rather than ‘input contraction.’
Recommendation: Expand justification in Sec. 2.3 for (i) input orientation and (ii) VRS (BCC) with citations to health-care DEA practice. Add at least one robustness model that better matches the clinical aim of improving success per attempt: e.g., output-oriented BCC (maximize live births given cycles), and optionally CRS vs VRS. Summarize whether the core patterns (low scores, age gradient, frontier shares) persist (Sec. 3.7 / Appendix). Tighten language in Sec. 3.3–3.6 and Sec. 4.2–4.3 so $\theta$ is interpreted as radial input reduction (not “ability to increase outputs”) unless an output-oriented model is also presented.
-
Case-mix adjustment beyond coarse age bands is insufficient, so ‘inefficiency’ may largely reflect patient severity/selection and clinic environment rather than performance (Sec. 1, Sec. 2.1–2.2, Sec. 3.3–3.6, Sec. 4.3). Within age groups, outcomes vary by diagnosis, ovarian reserve, prior ART history, use of ICSI/PGT, embryo transfer practices, comorbidities, and socioeconomic factors; ignoring these risks confounding and can create ethically problematic incentives if interpreted as rankings.
Recommendation: Create a dedicated Limitations subsection in Sec. 4.3 explicitly stating that scores are conditional on minimal risk adjustment and should not be interpreted as causal performance. If additional variables are available in NASS (or via linkage), add either: (i) a two-stage analysis (DEA then regression of scores on environmental/case-mix proxies), (ii) a conditional/non-discretionary-input DEA variant, or (iii) restrictions to more homogeneous subgroups. At minimum, add exploratory stratifications/correlates in Sec. 3.6 (e.g., by clinic volume, region, ownership if available) and qualify all cross-clinic comparisons accordingly.
-
The DMU definition (clinic–year–age group) implies stratum-specific frontiers and complicates cross-stratum statements (e.g., ‘efficiency decreases with age’) because these are comparisons across separate DEA runs rather than a single unified technology (Sec. 2.1, Sec. 3.3–3.4). Additionally, VRS ‘scale’ in this setup is essentially the number of intended retrievals within the stratum, which may not map cleanly to clinic operational scale.
Recommendation: In Sec. 2.1 and at the start of Sec. 3.3, clarify that efficiencies are computed relative to a year$\times$age-specific frontier and are not directly comparable across strata unless you adopt pooling, a meta-frontier, or a normalization strategy. If the paper’s narrative emphasizes age gradients, add a robustness check using pooled models (e.g., include age group as a categorical environmental factor, or estimate a meta-frontier) and explicitly discuss what changes. Consider adding a clinic-level aggregation robustness run to show how results differ when the DMU is ‘clinic-year’ (Sec. 3.7 / Appendix).
-
Zero and near-zero outputs are prevalent and materially shape the frontier, but the manuscript does not fully explain the mathematical behavior of the BCC model with $y_o = 0$, nor separate true zeros from construction/suppression-induced zeros (Sec. 2.2–2.3, Sec. 3.2.3–3.2.4, Sec. 3.5). Some statements risk implying that $y_o = 0$ forces very low $\theta$, which is not mathematically necessary under BCC-I.
Recommendation: In Sec. 2.3 and Sec. 3.5, add a short analytic explanation of how the LP behaves when $y_o = 0$ (output constraint becomes nonbinding; efficiency depends on input minimality within the convex hull). In Sec. 3.2.3–3.2.4, report the proportion of DMUs with (constructed) zero output by age and year and—after fixing output construction (continuous, no rounding)—reassess how many zeros remain. Add robustness checks excluding (a) zero-output DMUs and (b) very small $\text{Cycle\_Count}$ DMUs (where discretization dominates), and report how frontier composition and mean/median $\theta$ change (Sec. 3.7 / Appendix).
-
Suppressed/missing NASS cells (‘$*$’, ‘$-$’) are dropped, which likely removes small-volume clinics/strata non-randomly and can bias efficiency distributions and frontier identification (Sec. 2.2, Sec. 3.1–3.2). Because suppression is often related to privacy thresholds, the missingness mechanism is plausibly informative.
Recommendation: In Sec. 2.2 and Sec. 3.1–3.2, quantify suppression and deletions by year and age group, and compare $\text{Cycle\_Count}$ distributions for kept vs dropped records to assess selection. Consider sensitivity bounds or interval imputation approaches consistent with suppression rules (even a simple ‘best/worst case’ for suppressed cells), or clearly state the likely direction of bias (e.g., under-representing low-volume clinics).
-
No statistical inference or stability analysis is provided, despite DEA’s sensitivity to sampling variation, measurement error, and outliers—especially with many small-volume DMUs and constructed outputs (Sec. 3.3–3.4). Statements that year-to-year changes are ‘minor’ are not supported by formal uncertainty quantification (Sec. 3.4.1).
Recommendation: Add a robustness/inference component: within each stratum, use bootstrap DEA (e.g., Simar–Wilson style) or at minimum resampling-based confidence intervals for mean/median efficiency and bias-corrected scores. Complement with stability checks: trimming/winsorizing extreme DMUs, excluding very small $\text{Cycle\_Count}$, comparing DEA vs FDH. In Sec. 3.4.1, either provide uncertainty intervals for year comparisons or explicitly label the temporal analysis as purely descriptive and avoid inferential wording.
-
Positioning, novelty, and implications are underdeveloped, and the manuscript lacks substantive engagement with prior DEA-in-health/ART efficiency literature; practical meaning of $\theta \approx 0.25$ is not concretely interpreted (Sec. 1–2, Sec. 3.6, Sec. 4.2–4.3). This also heightens the risk that readers treat the results as clinic ‘rankings’ rather than conditional benchmarks with major limitations.
Recommendation: Add a Related Work subsection (Sec. 1.1 or Sec. 2.x) summarizing DEA applications in health care and any ART/fertility clinic benchmarking, including typical input/output choices and risk-adjustment practices. Near the end of Sec. 1, state clear research questions and contributions. In Sec. 3.6 and Sec. 4.3, translate efficiency scores into concrete DEA interpretations (input contraction under input orientation; output expansion under output orientation if added), provide at least one worked example with peer/reference sets, and add a dedicated Ethical/Policy Considerations paragraph cautioning against simplistic rankings and noting missing safety/equity/patient-centered outcomes.
-
Reproducibility and readability are impaired by reliance on internal file paths for key results and by missing/unclear data/code availability information (Sec. 3.1–3.5, Sec. 2.4).
Recommendation: Replace internal path references (e.g., `data/dea_analysis_results/...`) with numbered tables/figures in the paper or appendices, and include key numeric summaries directly in Sec. 3 (e.g., quartiles of $\theta$, zero-output shares, frontier counts). Add a Data and Code Availability statement (end of Sec. 2 or in Sec. 4) describing what can be shared, with a public repository link and enough documentation to rerun the pipeline (including versions, solver method/options, and data-processing scripts).