-
Ambiguous and potentially inconsistent definition of the scored population (“non-anomalous” vs “not eligible”) and mixing across target-specific datasets undermines Table 1 and the interpretation of “1,138 unique anomalous asteroids” (Sec. 3.1, 3.3.1–3.3.2; Table 1). Each target uses a different feature-complete subset (e.g., $\sim 10,340$ diameter vs $\sim 6,396$ spin vs $\sim 1,626$ obliquity), yet Results compare anomalies to very large “non-anomalous” counts that appear to include objects that were never scored/eligible. This makes differences in size/orbit potentially reflect eligibility/measurement availability rather than anomaly status.
Recommendation: In Sec. 3.1 and Sec. 3.3.1–3.3.2, define eligibility and denominators unambiguously for each target/model: (i) $N_{\rm eligible}$ (feature-complete, scored), (ii) $N_{\rm flagged}$ within that eligible set, and (iii) anomaly rate per target/model. When summarizing properties (Table 1, Fig. 21), compare flagged objects to an appropriate control group drawn from the same eligible dataset (“scored and not flagged”), not to the full master catalog. If you also want a unified cross-target list (the 1,138 unique objects), include a flow table: master $\rightarrow$ eligible per task $\rightarrow$ flagged per task $\rightarrow$ union, and clearly state which statistics use which subset.
-
Anomaly scoring appears to use in-sample predictions (training+test combined) and global residual statistics, risking biased residual distributions and distorted anomaly counts—especially for flexible NNs and the overfit-looking obliquity GP (Sec. 2.3–2.4, Sec. 3.2, 3.3.1). Computing $z$-scores from residuals that include training points can artificially shrink residual variance and change which points exceed $|S|>3$.
Recommendation: Revise Sec. 2.4/3.3.1 so anomaly scores are computed from strictly out-of-sample predictions: e.g., $k$-fold cross-validation with out-of-fold predictions for every eligible object, or a held-out test-only analysis (with the caveat that it reduces coverage). For NN scoring, compute mean/std of residuals from out-of-fold residuals only. Report (and optionally plot) training vs test residual distributions to show the magnitude of the bias avoided.
-
Lack of quantitative predictive-skill evaluation and uncertainty calibration prevents assessing whether residuals reflect conditional outliers or simply model inadequacy—most acute for spin period and obliquity (Sec. 3.2.1–3.2.2). The manuscript relies heavily on qualitative plots and kernel parameters, while acknowledging weak/mean-regressing behavior for spin and pathological/overfit behavior for obliquity; in such regimes, “anomalies” can reduce to extremes of the marginal distribution (spin) or be suppressed by misestimated uncertainty (obliquity).
Recommendation: Augment Sec. 3.2 with standard out-of-sample metrics for each model–target pair (RMSE/MAE in physical units and/or log-units, $R^2$, correlation). Include simple baselines (e.g., ridge/linear regression, random forest) to contextualize whether GPR/MLP add value. For GPR, add calibration/coverage checks: fraction of truths within nominal $1\sigma/2\sigma$ intervals. Use these results in Sec. 3.3–3.4 to gate interpretation: if a target’s model has near-zero conditional skill, treat its “anomalies” as low-confidence and separate them from the main physical conclusions.
-
The fixed $|S|>3$ threshold is not statistically justified given (i) large sample sizes, (ii) heavy-tailed/heteroscedastic residuals, and (iii) multiple testing across 3 properties $\times$ 2 model classes (Sec. 2.4, 3.3.1). For NN scores, global $z$-scoring ignores heteroscedasticity (visible in residual structure for diameter), and outliers can inflate the residual std, changing the effective threshold.
Recommendation: In Sec. 3.3.1, empirically characterize standardized-residual distributions per model/target (histograms + QQ plots; tail behavior; heteroscedasticity vs predicted value). Provide a sensitivity analysis of anomaly counts and key Table 1 statistics versus threshold (e.g., 2.5/3/3.5/4). Consider a multiple-testing-aware framing (expected false positives under a null; FDR control), or at minimum report expected vs observed $>3\sigma$ exceedances under a standard-normal assumption. For NNs, consider heteroscedastic-aware alternatives: bin-wise residual scaling, a second-stage variance model, quantile regression, or conformal prediction intervals, so “$|S|>3$” has clearer meaning.
-
Selection effects, missing-not-at-random measurement processes, and heterogeneous uncertainties across catalogs are likely to dominate both training and the anomalous set, but are not analyzed systematically (Sec. 2.1–2.2, 3.1, 3.3.2, 3.4). The usable training sizes ($\sim 10$k diameter; $\sim 6$k spin; $\sim 1.6$k obliquity) are tiny relative to the $1.7$M master list; measured spin/obliquity and even taxonomic type are strongly biased toward larger/brighter/better-observed objects. The finding that anomalies are “much larger” may therefore largely reflect measurement availability and/or model extrapolation rather than a distinct physical population.
Recommendation: Add a dedicated subsection (Sec. 3.1 or new Sec. 3.1.1) quantifying completeness and selection: for each property (diameter/spin/obliquity/age/type), report fraction available in the master set; show distributions of $(a,e,i)$, $H$/magnitude proxy if available, and diameter for (i) master, (ii) eligible per task, and (iii) flagged anomalies. If possible, summarize typical measurement uncertainties and any quality flags per source, and test whether anomalies are overrepresented in specific catalogs/surveys or low-quality subsets (Sec. 3.3.2). Temper Sec. 3.4/Conclusions to explicitly state that anomalies are defined within biased, inhomogeneous measured subsets.
-
Obliquity modeling/definition is internally unclear and appears numerically/pathologically fit (Sec. 2.2–2.4; Figures 9 and 19; Sec. 3.2.1, 3.3.1). Plots suggest the obliquity target spans $\sim [-1,1]$, inconsistent with an angle in degrees/radians, and the optimized GP kernel shows extremely small length scale and near-zero noise ($1{\rm e}{-}10$), suggestive of overfitting and/or optimizer boundary behavior. This makes both predictions and anomaly scoring for obliquity unreliable.
Recommendation: First, define the obliquity target precisely in Sec. 2.2 (units, range; if it is $\cos(\epsilon)$ or a normalized quantity, rename it accordingly and update interpretation and formulas). Second, report GP hyperparameter bounds, optimizer restarts, and whether parameters hit bounds (Sec. 2.3.1, 3.2.1). Add regularization: enforce a noise floor, adjust bounds, consider Matern kernels, and evaluate via cross-validation. Given the tiny sample ($1,626$; Sec. 3.1), consider presenting obliquity as exploratory/methodological only, or clearly flag obliquity anomalies as highly tentative (Sec. 3.3–3.4).
-
Physical interpretation currently outpaces the demonstrated inference: claims that the anomalous set is a distinct/primordial population are plausible but not uniquely supported, and alternative explanations (model extrapolation at large diameters; known large bodies/family parents; catalog systematics) are not ruled out (Abstract; Sec. 3.3.2–3.4; Conclusions). In particular, predicting diameter from orbit/age has limited causal grounding; “diameter anomalies” may reflect dataset composition and extrapolation failures rather than unusual physics.
Recommendation: Reframe conclusions to clearly separate empirical statements (conditional model residual outliers; large/low-$e$/$i$ among flagged objects) from origin hypotheses (primordial survivors). Add targeted cross-checks in Sec. 3.3.2: (i) stratify or match-control by diameter (and/or brightness proxy) to see whether anomaly status adds information beyond being large; (ii) inspect model behavior at the large-diameter end (are these just underpredicted extremes due to training imbalance?); (iii) cross-match the top anomalies with known large asteroids/dwarf planets/family parent bodies and note whether flags are expected; (iv) provide a small table of exemplar anomalies (IDs, measured vs predicted, score, data sources) and validate a few against literature.
-
Family membership is excluded as a predictor due to high cardinality (Sec. 2.2), but family context is relevant for both “age” interpretation and for whether anomalies simply trace family-specific trends or parent bodies (Sec. 3.3.2, 3.4). Without quantitative post hoc analysis, it is unclear whether the flagged set is dominated by a few families/background populations.
Recommendation: In Sec. 3.3.2/3.4, quantify family membership in anomalies vs matched controls: over/under-representation of major families, fraction of “background,” and whether anomalies cluster in specific families or dynamical regions. If feasible, include a coarse family encoding (largest-$N$ families + ‘other’, or embedding/target encoding) as an ablation in Sec. 3.2 and report the impact on predictive skill and anomaly lists.