-
Mechanical Viability Classifier: extreme class imbalance (207 viable vs only 8 unviable) and unclear construct validity of the negative label (Sec. 2.3, Sec. 3.2). With only 8 negatives, reported accuracy/F1 can be misleading, ROC-AUC estimates are unstable, and the model may be learning “similarity to the curated subset” or detection of calculation artifacts rather than true mechanical instability. Since this classifier and its uncertainty discount directly shape the Pareto front (Sec. 3.5), the evidential basis for the mechanical screening claims is currently weak.
Recommendation: Strengthen Sec. 2.3 and Sec. 3.2 with (i) label provenance and (ii) imbalance-appropriate evaluation: (1) Clearly explain how the 8 unviable points were identified and, if available in the source database/workflow, distinguish failed elastic calculations (numerical issues, non-convergence) from physically unstable outcomes (e.g., violated Born criteria, negative eigenvalues of elastic tensor, phonon instabilities). (2) Report per-fold (or aggregated) confusion matrices plus class-wise precision/recall/F1, emphasizing recall on the unviable class; include precision–recall curves or average precision for the minority class. (3) Add robustness/sensitivity analysis suited to tiny negatives: leave-one-negative-out experiments and/or bootstrapped confidence intervals for metrics; show whether model behavior depends on one or two negative points. (4) If possible, expand the negative set systematically (e.g., by using elastic-tensor validity checks, Born stability criteria where tensors exist, flagged failed calculations, or known unstable phases) and re-train/re-evaluate. (5) If expansion is not possible, temper claims in Sec. 4: present the viability classifier explicitly as a preliminary/conservative artifact/instability filter rather than a comprehensive detector of mechanical failure.
-
Data provenance and curation are under-specified for both the 1283-compound dataset and the 1068 “uncharacterized” screening set (Sec. 2.1, Sec. 2.3, Sec. 2.5). Given the paper’s emphasis on imperfect HT-DFT data, readers need to know which database/workflow produced energies, structures, tilts, and elastic quantities; how duplicates/polymorphs were handled; and how perovskite topology/Glazer tilt systems were assigned. Similarly, the elastic filtering thresholds (e.g., $0 < K_{\rm VRH} < 300$ GPa, $G_{\rm VRH} > 0$) and their impact on retained/excluded points are not justified quantitatively.
Recommendation: Expand Sec. 2.1, Sec. 2.3, and Sec. 2.5 (or add a dedicated data/curation subsection) to include: (1) explicit source database(s) and citations; key DFT workflow details relevant to energies and elastics (functional, pseudopotentials, k-point density, cutoffs, relaxation/stress criteria; or a pointer to the workflow documentation). (2) How duplicates/polymorphs were treated (e.g., lowest $E_{\rm hull}$ retained; how multiple structures per composition are handled). (3) The algorithm/rules mapping space group (or structure) to Glazer tilt labels, and treatment of ambiguous/unknown tilt assignments. (4) For elastics: show distributions/histograms of $K_{\rm VRH}$ and $G_{\rm VRH}$ and report how many points are excluded by each rule (negativity vs upper bound); justify the $K_{\rm VRH} < 300$ GPa cutoff with references or dataset-driven reasoning, and clarify whether excluded points represent artifacts vs genuine high stiffness. (5) Define precisely how the 1068 “uncharacterized” set is derived from the 1283 total (missing elastic data only? missing experimental reports? something else), and confirm all 1068 satisfy the structural perovskite criterion used in the study.
-
Uncertainty discount and Pareto-front candidate selection are under-specified and insufficiently validated (Sec. 2.3, Sec. 2.5, Sec. 3.2, Sec. 3.5). The GPR predictive variance is used as an epistemic uncertainty proxy and combined multiplicatively with classifier probabilities via an ad hoc min–max normalization (Eq. (1)), but the variance is not calibrated/validated (does higher variance actually imply higher error?) and min–max scaling makes the discount dependent on the candidate set and sensitive to outliers. The Pareto-front extraction procedure and any tie-breaking/secondary ranking for Table 1 are described only qualitatively, limiting reproducibility and confidence in the stability of the 16 reported candidates.
Recommendation: In Sec. 2.3 and Sec. 2.5, fully specify and validate the uncertainty and optimization steps: (1) Provide the exact Pareto-front extraction algorithm (e.g., non-dominated sorting in the 2D objective space) and any subsequent ranking rule used to order/select the 16 materials in Table 1; include pseudocode or a concise step-by-step description. (2) Evaluate uncertainty quality for $K_{\rm VRH}/G_{\rm VRH}$ GPRs: calibration/coverage of prediction intervals and/or a binned plot of predicted variance vs absolute error under cross-validation (show that variance correlates with error). (3) Add sensitivity/robustness checks in Sec. 3.5: recompute fronts under bootstrap/CV variability and report how often each candidate appears; test at least one alternative discounting scheme (e.g., quantile-based scaling, clipping, a tunable penalty strength, or a lower-confidence-bound style score). (4) Explicitly define in-text the domain for min/max in Eq. (1) (screened set vs training set) and address outlier sensitivity (e.g., use robust scaling or clipping).
-
Potential information leakage / shortcut learning in thermodynamic stability prediction due to inclusion of formation energy features when the label is $E_{\rm hull}$ (Sec. 2.2, Sec. 3.1, Sec. 3.4.1). Formation energy is thermodynamically closely tied to hull construction; if it is computed from the same DFT energies used to build $E_{\rm hull}$ in the same database, the stability classifier may be learning a near-direct proxy rather than generalizable structure–chemistry relationships, complicating interpretability claims and transferability under LOCO.
Recommendation: Clarify in Sec. 2.1–2.2 the provenance of formation energy (is it computed from the same total energies and reference states used for the convex hull in the same dataset?). Then add an ablation study in Sec. 3.1: report LOCO performance with and without formation energy (and optionally other highly correlated global energetics features), and comment on how SHAP attributions change. If performance drops substantially, frame conclusions accordingly (the model is powerful but partly leverages database-specific thermodynamic bookkeeping); if performance remains strong, that strengthens claims of learned transferable descriptors.
-
Probabilities are used directly as screening objectives (stability probability, viability probability) but probability calibration is not assessed (Sec. 2.2, Sec. 2.5, Sec. 3.1, Sec. 3.2). Miscalibrated probabilities can distort the Pareto geometry and systematically favor certain chemistries (e.g., specific A-site clusters) even if ranking metrics like AUC look acceptable.
Recommendation: Add calibration diagnostics for both classifiers in Sec. 3.1 and Sec. 3.2: reliability diagrams and summary metrics (Brier score, ECE), ideally also stratified by LOCO fold/A-site cluster to detect chemistry-dependent miscalibration. If miscalibration is non-trivial, apply Platt scaling or isotonic regression (trained properly within the CV scheme) and use calibrated probabilities in Sec. 2.5 screening; report how calibration changes the Pareto front and Table 1 candidates (Sec. 3.5).
-
Reproducibility/implementation detail is incomplete across modeling stages (Sec. 2.1–2.4, Sec. 3.3). Several choices materially affect results but are not fully specified: feature preprocessing (scaling/transforms/imputation), categorical encoding (tilt systems/space group/crystal system), hyperparameters and tuning for GBC/GBR/GPR and the hurdle model, class weighting/resampling for imbalance, train/test split protocol and seeds, and how VRH moduli were computed from elastic tensors (if applicable).
Recommendation: Add a concise but complete implementation subsection (new Sec. 2.6 or expanded Sec. 2.1–2.4) listing: (1) the full feature list and preprocessing steps (scaling, log transforms, missing-value handling); (2) categorical encodings for structural/tilt descriptors and how unknown/ambiguous labels are treated; (3) model hyperparameters and tuning strategy (grid/random/Bayesian search; objective; CV setup), including any class weights; (4) exact split protocols and random seeds for each stage (including hurdle model); and (5) the procedure/software used to compute $K_{\rm VRH}/G_{\rm VRH}$ from elastic tensors (or the database field definitions). Provide code and/or processed datasets if possible, or at minimum enough detail to reproduce Table 1.