-
Identification assumptions for IPW (conditional ignorability/MAR and positivity) are not explicitly stated, justified, or stress-tested, yet they are essential for interpreting the weighted results as unbiased population estimates (Introduction, Secs. 2.3–2.5, Secs. 3.2–3.5, Sec. 4). The current propensity models condition only on orbital elements and diameter, but many plausible drivers of whether a property is measured are not represented: apparent brightness/$H$, observing geometry/phase angle coverage, survey footprint/strategy, discovery circumstances, follow-up targeting (e.g., unusual colors/lightcurve amplitude), NEO-ness/MOID, number of observations/arc length, and epoch. For endpoints like spectral type and spin period, it is plausible that the property value (or proxies correlated with it) influences follow-up even after conditioning on $(a,e,i,\Omega,\omega,D)$, which would violate MAR and bias IPW “corrections.”
Recommendation: Add a dedicated “Assumptions and identification” subsection (end of Sec. 2.4 or start of Sec. 2.5) that: (i) states the estimand (distribution over the master catalog) and the required assumptions (e.g., $\mathrm{Observed}_X \perp X\,|\, Z$ and positivity $0<P(\mathrm{Observed}_X=1|Z)<1$); (ii) argues property-by-property why the assumption is plausible or likely violated; and (iii) discusses likely directions of bias under violations. Where feasible, expand $Z$ with more proximate observability/targeting covariates available at scale (e.g., absolute magnitude $H$, perihelion distance $q$, MOID/NEO flag, number of astrometric observations/arc length/oppositions, discovery epoch, survey/catalog-of-origin flags). Report in Sec. 3.1 whether discrimination/calibration improve and, crucially, whether key weighted conclusions in Secs. 3.2–3.4 are stable.
-
Using diameter as a “universal” predictor is potentially problematic and may induce endogeneity or hidden selection, undermining the propensity model and its interpretation (Sec. 2.1–2.2, Sec. 3.1). In many catalogs, diameter is not directly observed for all asteroids; it is inferred using thermal IR surveys (with their own selection) or brightness plus assumed/albedo-dependent modeling. If diameter is missing for a nontrivial subset, or if its construction is correlated with the same follow-up processes that generate the sparse properties, it may not be an appropriate conditioning variable and can leak selection information in ways that are hard to interpret.
Recommendation: In Sec. 2.1–2.2, document the provenance of Diameter\_km (source(s), method of derivation, coverage fraction, and how conflicts are resolved). Report the fraction of the 1.45M catalog with diameter and the fraction used in each propensity model. Add an ablation/sensitivity study (Sec. 3.1 and/or Appendix): fit propensity models (a) with diameter, (b) without diameter, and (c) with $H$ (and optionally both $H$+diameter) where available; compare calibration/overlap and show the effect on at least the headline corrected quantities (e.g., C/S fractions, median spin period, top family frequencies in Sec. 3.2–3.3). If diameter is near-universal only because of imputation/assumptions, state this clearly and discuss implications in Sec. 4.
-
Calibration and overlap (positivity) diagnostics are not quantified, despite being central to IPW stability and credibility (Secs. 2.3.4–2.4, Sec. 3.1, Figs. 6–11). Visual calibration curves alone are insufficient because small probability miscalibration in the low-propensity tail can dominate weighted estimates. The manuscript also does not report overlap diagnostics (propensity distributions among observed, minimum/percentiles), effective sample size (ESS) after weighting, or how much inference is driven by a small subset of extreme weights.
Recommendation: In Sec. 3.1, add quantitative calibration metrics on a held-out test set for each propensity model (Brier score and either ECE or calibration slope/intercept; optionally log-loss). Include tail-focused diagnostics (e.g., calibration restricted to $\hat p<0.01$ and $\hat p<0.001$, with bin counts). Add overlap/positivity diagnostics per property: summary of $\hat p$ among observed objects, weight quantiles, ESS (e.g., $(\sum w)^2/\sum w^2$), and the fraction of total weight carried by the top 1% of objects. If post-hoc calibration (Platt/isotonic) is used, specify it in Sec. 2.3.4 and show metric improvements.
-
Sensitivity to truncation/stabilization choices and to propensity-model specification is not demonstrated, yet the reported debiasing can be highly sensitive to these design decisions (Sec. 2.4, Secs. 3.2–3.4). The current text also contains ambiguity about “raw,” “truncated,” and “stabilized” weights (Sec. 2.4, Sec. 3.1, Fig. 11 captions), making it unclear which weights underpin each downstream table/figure.
Recommendation: In Sec. 2.4, provide explicit mathematical definitions for each weight variant used (e.g., raw $w_i=1/\hat p_i$; truncated $\tilde w_i=\min(w_i,c)$; stabilized $w_i^{\mathrm{stab}} = \Pr(\mathrm{Observed}=1)/\hat p_i$ if applicable) and state unambiguously which variant is used in each analysis (Secs. 3.2–3.4; Tables 2–6). Then perform a sensitivity analysis: recompute a small set of headline results under multiple caps (e.g., 95th/97.5th/99th/99.5th percentiles) and at least one alternative well-calibrated model class (e.g., logistic regression with calibration, or gradient boosting). Summarize the variation in Sec. 3.5 and temper claims in Sec. 4 accordingly.
-
Uncertainty quantification is absent for the bias-corrected estimates, preventing assessment of which reported shifts are robust versus within statistical/model uncertainty (Secs. 3.2–3.4, Tables 2–6, Sec. 3.5). IPW typically increases variance, and here weights are also estimated, introducing additional uncertainty. Without confidence intervals (CIs) or standard errors (SEs), statements about changes (e.g., modest changes in taxonomic fractions or spin-period summaries) cannot be evaluated for practical/statistical significance.
Recommendation: Augment Secs. 3.2–3.4 and Tables 2–6 with uncertainty for key quantities (weighted means/medians/proportions and selected cross-feature summaries). Prefer a two-stage bootstrap that resamples asteroids and refits the propensity model (or, if too expensive, a hybrid bootstrap that reuses fitted propensities but resamples outcomes with weights, clearly labeled as partial). Report 95% CIs for headline shifts (e.g., C/S fractions, median spin period, top-N family fractions). In Sec. 3.5, explicitly distinguish results that remain meaningfully shifted under CIs/sensitivity analyses from those that do not.
-
FamilyName and Age\_Gyr are treated like object-level “observed/not observed” endpoints, but their construction and selection mechanisms differ qualitatively from spectral/spin/obliquity and may be circular with the predictors (Sec. 2.1, Secs. 3.3.2–3.4). Family identification is typically derived from (proper) orbital elements; thus predicting “has\_FamilyName” from orbital elements may be partly tautological rather than a model of observational selection. Age estimates are often assigned at the family level (not per object), so interpreting IPW-weighted per-object age distributions requires careful definition of the estimand (object-weighted vs family-weighted) and selection process (propensity for families to have ages, not just objects).
Recommendation: In Sec. 2.1, define precisely: (i) what “FamilyName” represents (membership in any family vs background; which catalog; whether multiple memberships exist and how resolved), and (ii) what “Age\_Gyr” represents (per-family constant vs per-object estimate; source and reconciliation). In Sec. 2.3, justify whether modeling “has\_FamilyName” is intended as an observational selection process or merely a completeness-of-family-classification process; if the latter, explicitly reframe it. For ages, consider a hierarchical alternative: model (a) propensity to be assigned to a family, and (b) propensity for a family to have an age estimate; then report both object-weighted and family-weighted age summaries (Sec. 3.4). If a hierarchical treatment is out of scope, narrow claims about “intrinsic age distributions” (Sec. 4) and clearly state the estimand being estimated.
-
Key implementation and estimator details remain under-specified, limiting reproducibility and making it difficult to verify that downstream weighted quantities are well-defined (Secs. 2.1–2.6, Sec. 2.5, Sec. 3.1). This includes: data provenance/versioning; exact preprocessing (log transforms, handling of zeros/negatives, clipping $\hat p$ away from 0); train/validation/test splitting and seeds; Random Forest hyperparameter search spaces and final settings; whether/why features were scaled for RF; and explicit formulas for weighted estimators (Hájek vs Horvitz–Thompson, weighted quantiles/median, weighted correlation).
Recommendation: Expand Secs. 2.1–2.6 to include: (i) catalog sources, query dates, and selection criteria (numbered/multi-opposition, filtering); (ii) precise preprocessing steps with formulas (e.g., log or log1p with $\epsilon$; angular encoding; missing-value handling) and confirmation that any scaling is fit on training data only; (iii) split strategy (shared vs per-target splits, stratification, random seeds); (iv) hyperparameter ranges, CV setup, and chosen hyperparameters per model (table in main text or Appendix); and (v) explicit definitions of all weighted estimators used in Sec. 2.5 (and which software functions implement them). Also add explicit safeguards: clip $\hat p\in[\epsilon,1-\epsilon]$ prior to inversion and document $\epsilon$. Provide code and/or a DOI-linked repository in Sec. 2.6 where feasible.