-
The central framing in the Abstract, Introduction, Results summaries, and Conclusions (Sec. 1, 3.5.3, 4) sometimes implies biological insights into epigenetic aging, brain morphology, and cognition in bats, yet all neuroimaging and voxel-wise inferential results are derived from simulated MRI/GM data, with no real neuroanatomical measurements (Sec. 2.3–2.4, 3.3–3.5). Phrases such as "investigating the intricate relationships" or suggesting that null GLM results "confirm" the pipeline’s ability to prevent false positives risk giving the impression that neuroanatomical aging patterns and brain–cognition mechanisms have already been empirically characterized, which exceeds what is currently demonstrated.
Recommendation: Systematically revise wording throughout the Abstract, Introduction (Sec. 1), Results summaries (Sec. 3.4–3.5), and Conclusions (Sec. 4) to clearly and consistently frame the study as methodological pipeline development and *simulation-based* validation. Explicitly distinguish between real components ($\rm DNAmAge$ and behavioral descriptions; Sec. 3.1–3.2) and simulated neuroimaging/statistical components (Sec. 3.3–3.5). Soften or remove language implying achieved mechanistic understanding of neuroanatomical aging or brain–cognition relationships, and in Sec. 4.2–4.3 clearly state that biological inferences will only be possible once real MRI data have been acquired and analyzed with this pipeline.
-
The derivation and validation of the DNAmAge measure, which is central to the study, are largely treated as a black box. The manuscript does not describe tissue source, methylation assay, normalization, clock type (species-specific vs. pan-mammalian), training/validation strategy, or performance metrics for the epigenetic clock in Egyptian fruit bats (Sec. 1, 2.1, 3.1, 4.1). This limits interpretability, reproducibility, and confidence in $\rm DNAmAge$ as a biologically meaningful aging axis.
Recommendation: Add a dedicated subsection to Methods (e.g., Sec. 2.5) detailing $\rm DNAmAge$ derivation: sample source (e.g., blood or skin), profiling platform, preprocessing/normalization, the clock model used (including references if previously published), training data, cross-validation or external validation performance (e.g., correlation with chronological age, median absolute error), and units/calibration. If the clock is unpublished, provide sufficient methodological detail to enable replication and justify its suitability for this species. Briefly summarize key aspects again in Sec. 3.1 and Sec. 4.1 when discussing $\rm DNAmAge$ distributions and implications.
-
The design and parameterization of the simulated neuroimaging and mediation data are only qualitatively described (Sec. 2.3–2.4, 3.3–3.5). Details such as image dimensions, voxel size, spatial covariance and smoothness of noise, the exact size and location of the age-sensitive GM ROI, effect sizes (beta coefficients, correlations) for the age–GM relationship, and the generative model for the simulated mediator (e.g., "Mean_GM_in_Age_ROI") are not specified. Without this information, readers cannot judge how realistic or challenging the simulations are, nor interpret why an embedded effect failed to survive correction.
Recommendation: Introduce a clear simulation protocol section (either as a new Sec. 2.6 or by expanding Sec. 2.3–2.4) that fully specifies: (a) the generation of synthetic $b=0$-based anatomical and GM images (including voxel size, number of slices, spatial noise structure, smoothing), (b) the construction of the age-related ROI (mask definition, location, volume), and the mapping from $\rm DNAmAge$ to GM values within that ROI (effect size, variance), and (c) the generative model for the mediator and outcome variables used in mediation (Sec. 2.4.3, 3.5.1). In Sec. 3.3–3.5, report standardized effect sizes (e.g., partial $r$, Cohen’s $d$) and summarize how strong the simulated signal is relative to noise. This will clarify the realism and difficulty of the detection problem.
-
The pipeline’s statistical power and detection limits are not quantitatively explored. Although an age–GM effect is intentionally embedded in the simulated ROI, voxel-wise tests do not yield TFCE-corrected significance at $N=33$ (Sec. 3.4.1), yet this is discussed mainly qualitatively. Without systematic evaluation of power across effect sizes and ROI characteristics, it is unclear whether the null findings reflect conservative thresholds, small sample size, weak imposed effects, or aspects of the implementation, which weakens the claim of validation.
Recommendation: Augment the simulation study with a sensitivity/power analysis (Sec. 2.4.4, 3.4–3.5). For example, generate multiple simulated datasets varying the magnitude and spatial extent of the age–GM effect and report the proportion of runs in which voxel-wise TFCE-corrected statistics reach significance at $N=33$. Similarly, for mediation (Sec. 3.5), vary the strengths of the $X$–$M$ and $M$–$Y$ paths and report power to detect the indirect effect using the chosen bootstrap/resampling scheme. Present these findings (e.g., as an additional figure or Appendix) and discuss in Sec. 3.4–3.5 and Sec. 4.2–4.3 what range of effect sizes the current pipeline can reasonably detect and how sample size or analysis parameters would need to change for future real-data studies.
-
There are inconsistencies between the statistical methods described in Methods and what appears to have been implemented in Results for voxel-wise inference (Sec. 2.4 vs. Sec. 3.4). Sec. 2.4 states that FSL’s "randomise" with TFCE and $5000$ permutations was used, whereas Sec. 3.4 mentions 'permuted$\_$ols' with $100$ permutations and does not clearly state whether TFCE was applied. This discrepancy obscures which pipeline configuration was actually validated and limits reproducibility.
Recommendation: Harmonize Sec. 2.4 and Sec. 3.4 by clearly stating the *actual* software, permutation counts, and multiple-comparison correction methods used for each GLM. If demonstration runs used a different implementation (e.g., Python 'permuted$\_$ols' with $100$ permutations and cluster-based thresholds) than the intended final pipeline in FSL, explicitly acknowledge this, justify the choice (e.g., computational constraints), and clarify how closely it approximates the target pipeline. Where feasible, rerun key simulations with the intended settings (e.g., $\sim 5000$ permutations, TFCE) and summarize these configurations in a concise table listing design, covariates, permutations, and correction for each analysis (Sec. 2.4/Sec. 3.4).
-
Reproducibility of the overall framework is limited by missing lower-level implementation details and lack of accessible code (Sec. 2–3). While high-level steps are described (e.g., BET/FAST, ANTs template, behavioral metrics), important practical information is absent: specific software versions, key parameter values (e.g., BET thresholds, FAST priors, ANTs registration options, smoothing kernel justification relative to bat brain size), precise handling of missing behavioral data, and scripts for GLMs and mediation (Sec. 2.1–2.4, 3.2–3.5).
Recommendation: Expand Methods (Sec. 2.1–2.4) with implementation details sufficient for independent reproduction. List versions of FSL, ANTs, and the R/Python packages used; provide key parameter settings for BET, FAST, template construction, registration, modulation, and smoothing (including rationale for $\rm FWHM = 4~mm$ in the bat brain; Sec. 2.3.3). Explicitly describe how missing behavioral data were handled for each metric and analysis (Sec. 2.2.2, 3.2, 3.4–3.5), including final $N$s. For mediation (Sec. 2.4.3), specify software, function calls, bootstrap settings, and covariates. Ideally, release analysis scripts in a public repository and cite it in Sec. 2 and Sec. 4.1; at minimum, include pseudocode or a pipeline diagram summarizing key steps and decision points.
-
Despite having real DNAmAge and behavioral data for 33 bats, the Results provide only descriptive summaries and no inferential analyses of age–cognition relationships (Sec. 3.1–3.2). Given that one of the stated aims is to link epigenetic aging to cognitive performance, the absence of even basic statistical tests on the real data limits the immediate scientific contribution and leaves the empirical part of the study underutilized.
Recommendation: Include a set of straightforward inferential analyses in Sec. 3.2 (or a new Sec. 3.2.3) examining associations between $\rm DNAmAge$ and key cognitive metrics. For example, fit simple linear models or correlations between $\rm DNAmAge$ and a small, pre-specified subset of learning/STM/LTM outcomes, adjusting for sex and origin colony as appropriate and applying multiple-comparison control. Clearly label these as exploratory/illustrative, separate them from simulated imaging analyses, and discuss in Sec. 4.2 how these preliminary findings (even if null) inform expectations for future multimodal work.
-
Figure 1 lacks visible panel labels (A/B) despite caption references, omits group sample sizes, does not display statistical comparisons to support claims of comparable distributions, and suffers from low resolution and small typography.
Recommendation: Add clear (A)/(B) panel annotations matching the caption, display group sample sizes on the figure, overlay or annotate group comparison results (e.g., t-test or Mann–Whitney $U$ with effect size and $95\%$ CI), and export at higher resolution with larger fonts and line widths.
-
Figure 3's caption claims a mask overlay QC, but the displayed panels lack a visible mask overlay; the figure also has low resolution, missing orientation markers (L/R, A/P, S/I), and no legend for overlay colors.
Recommendation: Add a high-contrast brain-mask overlay, export at $\geq 300$ dpi with larger fonts, include orientation labels in each panel, and provide a concise legend for overlay colors in the caption.
-
Mediation algebra inconsistency: Table 3 reports $a = -5.3417$ and $b = 0.0505$ but also reports $a b = 0.2931$ (positive). The product of the reported $a$ and $b$ is negative ($\approx -0.27$), so at least one of $\{b, a b, \text{ or the sign of } a\}$ is wrong. This also conflicts with the indirect-effect CI sign expectations given $a < 0$ and $b > 0$.
Recommendation: Recompute and correct Table 3 so that the indirect effect equals the product of the reported $a$ and $b$ coefficients (with consistent sign), and ensure the reported CI corresponds to that same indirect effect. If $b$ is actually negative, correct its sign (and associated CI/p-value) in both Table 3 and Figure 6.
-
Methods §2.3.1 (page 4) states that three $b=0$ volumes were extracted and averaged, but Results §3.3 item 1 (page 7) states that the first $b=0$ volume was extracted as the representative anatomical image (3 vs 1).
Recommendation: Clarify whether these statements refer to different pipeline stages (e.g., averaging for preprocessing vs selecting one for visualization/registration), or correct the text so the number of $b=0$ volumes used is consistent across Methods and Results.
-
Permutation testing count is inconsistent: Methods §2.4.1 (page 5) specifies $5000$ permutations, while Results §3.4 (page 8) reports using $100$ permutations.
Recommendation: Reconcile the permutation count by updating either the Methods or Results to reflect the actual analysis configuration; if multiple runs were performed (e.g., pilot vs final), label them explicitly.
-
Bootstrap resample count for mediation is inconsistent: Methods §2.4.3 (page 5) specifies $10,000$ resamples, while Results §3.5.1 (page 9) specifies $5000$ resamples.
Recommendation: Confirm the resample count actually used for the reported mediation results and revise the Methods/Results to match; if both were run, specify which count corresponds to Table 3.
-
Mediation Table 3 algebra is inconsistent: Results §3.5.1 (page 9), Table 3 reports indirect effect $a b = 0.2931$, but the product of the reported $a$ and $b$ coefficients is $-0.2698$ ($a = -5.3417$, $b = 0.0505$).
Recommendation: Recompute and correct the indirect effect (and any downstream quantities that depend on it) or explain why the reported indirect effect is not computed as the product of the displayed $a$ and $b$ (e.g., different scaling/standardization or different coefficient definitions).