-
Internal inconsistencies in sample size and feature dimensionality undermine interpretability and reproducibility. Methods state a final analytical cohort of $N=31$ and MD extracted from 82 regions (Sec. 2.1–2.4.3; Sec. 2.2), whereas Results and figure captions refer to $N=33$ (Sec. 3; Fig. 1), and Sec. 3.2 reports modeling with MD from 24 regions. The sex distribution is also inconsistent between Methods and Fig. 1. These discrepancies directly affect model dimensionality ($p$), regularization behavior, and any claims about selected regions/interactions.
Recommendation: Provide a single, definitive accounting of: (i) $N$ per modality (behavior, DNAmAge, DTI), (ii) $N$ used for CAI/PCA, (iii) $N$ used for each model (static ElasticNet, static RF, interaction ElasticNet, PDP analyses), and (iv) exclusions with reasons (QC failures, missing files, aborted behavior), ideally as a table or CONSORT-style flow diagram in Sec. 2.4.3. Harmonize $N$ and sex counts/percentages across Sec. 2–3 and figure captions. Explicitly state the exact number of MD regions used in each analysis stage, and justify any reduction from 82 to 24 (QC, missingness, atlas subset, variance filtering), including a list of retained regions (Appendix is fine).
-
DTI acquisition, preprocessing, atlas registration, and MD extraction are under-specified, preventing evaluation of data quality and anatomical validity (Sec. 2.2). Key details are missing (scanner/field strength; TR/TE; voxel size; $b$-values; \#directions; distortion/motion/eddy correction; tensor fitting method; registration steps; QC; partial volume mitigation; hemisphere handling). Without these, it is difficult to interpret MD as “microstructural integrity” or to assess potential artifacts driving region effects.
Recommendation: Expand Sec. 2.2 substantially (or add an Appendix) to include full acquisition parameters and preprocessing steps (skull stripping, motion/eddy/susceptibility correction, tensor fitting algorithm/software, registration pipeline to atlas/template, QC criteria and exclusion counts). Describe how atlas labels were transformed to individual space, whether ROI erosion/thresholding was used to reduce boundary contamination/CSF partial volume, and whether hemispheres were averaged. Report MD units/scaling and any normalization across subjects.
-
CAI definition/interpretation is internally inconsistent and construct validity is not sufficiently justified. Methods state PC1 loadings were intended to be negative for all four metrics so higher CAI indicates “better adaptability” (Sec. 2.5.2), but Results show mixed-sign loadings and interpret CAI as a trade-off axis (high CAI = better long-term but worse short-term; Sec. 3.1; Table 1). Moreover, it is not established that a single scalar PC1 is the most appropriate outcome vs a 2D (PC1/PC2) characterization or direct modeling of the four metrics.
Recommendation: Rewrite Sec. 2.5.2 and Sec. 3.1 to match the actual CAI used: define CAI as the signed PC1 score (explicitly state whether PC1 was multiplied by $-1$) and whether it is a “trade-off/strategy axis” vs a monotone “adaptability” score. Report PCA details ($N$ used; eigenvalues; variance explained by PC1/PC2; full loading matrix at least for PC1–PC2; scree plot/biplot in supplement). Provide a brief construct-validity argument for why PC1 is the target outcome, and consider adding a robustness analysis: (i) predict PC1 and PC2 separately, or (ii) predict the four behavioral metrics via multivariate/multi-task regression, or (iii) define $a$ priori composites for “short-term flexibility” (Phase 2) and “long-term memory/stability” (Phase 3) and show conclusions are consistent.
-
Cross-validation pipeline and potential information leakage are not sufficiently specified. It is unclear whether CAI PCA, feature standardization, and interaction-term construction were performed globally once or nested within the outer LOOCV folds (Sec. 2.5–2.7). If PCA/scaling were computed on the full dataset prior to CV, this can bias generalization estimates and feature selection, especially with $N\approx 31$ and high-dimensional predictors.
Recommendation: In Sec. 2.5–2.7, explicitly state the exact order of operations inside the outer LOOCV loop. Best practice is: within each outer training fold, fit PCA for CAI (if CAI is treated as derived from data), compute scaling parameters, construct DNAmAge$\times$MD interactions, tune hyperparameters via inner CV, fit the final model, then evaluate on the held-out subject. If CAI PCA was computed once on the full cohort, acknowledge this as a limitation and add a sensitivity analysis recomputing CAI within folds (or show empirically that results are unchanged). Confirm all preprocessing is recomputed within training folds to avoid leakage.
-
Inference and evidential framing for DNAmAge$\times$MD interactions are currently not supportable as written. The interaction model is a $p\gg N$ search (e.g., demographics + 82 MD main effects + 82 DNAmAge$\times$MD interactions; Sec. 2.7.1), yet selected non-zero ElasticNet coefficients are described as “statistically significant” (Sec. 2.7.1; Sec. 3.3; Table 2) without a defined hypothesis-testing framework, multiple-comparisons control, or coefficient stability assessment. Additionally, the manuscript does not clearly report the cross-validated performance of the interaction-augmented model; if CV $R^2$ remains low/negative, strong mechanistic claims are not warranted.
Recommendation: Replace “statistically significant” with “selected/non-zero under ElasticNet” unless a formal inferential procedure is added. Report full out-of-sample performance for the interaction model (LOOCV $R^2$ plus MAE/RMSE) and provide uncertainty (e.g., bootstrap over subjects or repeated CV where feasible). Add at least one robustness/inference layer appropriate for high-dimensional selection: (i) permutation testing of the entire modeling pipeline to assess whether performance exceeds chance; (ii) stability selection / bootstrap inclusion frequencies for interactions (how often each DNAmAge$\times$MD term is selected across resamples); and/or (iii) an $a$ priori ROI interaction test set with FDR correction. Temper Sec. 3.3 and Sec. 4 to present interaction findings as exploratory/hypothesis-generating pending replication.
-
Random Forest methodology and interpretation (especially age-stratified PDPs) are under-specified and may overstate evidence for moderation. RF hyperparameters beyond n_estimators/max_features are not reported (Sec. 2.6.2; Sec. 3.2–3.3), and the stability of permutation importance is unclear. Moreover, splitting bats into “younger/older” by median DNAmAge for PDP comparisons yields small groups ($\approx15$ each) and can create unstable apparent differences; PDPs also assume feature independence and can be misleading in correlated predictors (Sec. 2.7.2; Sec. 3.3; Fig. 12).
Recommendation: Report the complete RF setup (tuned ranges and final values for max_depth, min_samples_leaf, min_samples_split, bootstrap, random_state, etc.) and provide stability checks for feature importance across seeds/resamples (Appendix acceptable). For moderation claims, clarify whether PDPs were computed from a single global model or separate models per age group; report group $N$s and data density across MD ranges; and soften language to “suggestive.” Consider adding ICE curves or uncertainty bands, or using ALE plots. If the key claim is interaction/moderation, prioritize interaction-capable models (with stability/inference) over visual PDP group contrasts.
-
Anatomical interpretability is currently blocked because regions are referenced only by atlas indices (Regions 9, 22, 23; also 3, 4, 19) with no mapping to anatomical structures or tissue types (Sec. 2.2; Sec. 3.2–3.3). This prevents readers from assessing biological plausibility (e.g., hippocampal/striatal/frontal involvement) and connecting to cognitive aging literature.
Recommendation: Provide an atlas index$\rightarrow$anatomical label table (main text or Appendix) and identify whether each ROI is gray/white matter. In Sec. 3.2–3.3 and figure captions, refer to both index and anatomical name (e.g., “Region 22 (hippocampal formation, if applicable)”). Add a figure visualizing the implicated ROIs on the atlas/template. In Sec. 4, discuss these structures in relation to re-learning, perseveration, and known age-related microstructural changes.
-
Confounding, representativeness, and limitations are not comprehensively addressed given the small $N$ and multi-modal integration. Potential confounds include scan/batch effects, time lag between behavior and imaging, health/body condition, housing/handling, and colony-specific differences; MD is non-specific and DNAmAge measurement error/clock calibration is not described in enough detail (Sec. 2.3; Sec. 4).
Recommendation: Add a dedicated limitations paragraph/subsection in Sec. 4 addressing: small sample size and selection instability; $p\gg N$ interaction search; MD’s biological non-specificity and partial volume sensitivity; DNAmAge uncertainties and (if available) correlation with chronological/minimum age; possible batch/time-between-measures confounds; and generalizability across colonies/sex. Where possible, report key correlations (DNAmAge with any chronological estimates; DNAmAge with colony/sex; time between scan and behavior) and consider adding batch/time covariates or sensitivity analyses.