-
Manuscript genre and audience are not consistently defined: the text alternates between an empirical aging/brain/cognition study and a methodological failure-case report, which blurs what readers should take away and what standards of evidence apply (Sec. 1, Sec. 4.1–4.4).
Recommendation: Decide and signal the genre explicitly in Sec. 1 and Sec. 4: if this is primarily a methods/failure case study, state up front that no biological inference is attempted due to invalid downstream data, and restructure the Introduction/Conclusions to emphasize (i) the intended scientific question only insofar as it motivates pipeline requirements and validation criteria, and (ii) the deliverable as documented failure modes + guardrails/checklists. Alternatively, if aiming for an empirical paper, the behavioral extraction must be fixed and results re-run; the current version should not present inferential modeling outputs as results.
-
Behavioral feature-extraction failure—the central failure mode—is described too generically to be reproducible or broadly instructive. Readers cannot see what the Excel files actually look like, what $\texttt{read\_excel}$ returned, or which concrete discrepancies caused the “0 entries parsed” outcome (Sec. 2.2, Sec. 3.2.1).
Recommendation: Augment Sec. 2.2 and Sec. 3.2.1 with concrete diagnostics: (1) a table comparing assumed vs. actual sheet names and column headers (including whitespace/case/merged-cell artifacts); (2) an anonymized snippet (header row + 5–10 rows) from 1–2 representative files; (3) the exact $\texttt{pandas.read\_excel()}$ parameters used (engine, header, skiprows, dtype) plus the resulting column-name list printed by the script; (4) a mapping from each discrepancy to the observed failure signature (KeyError, empty dataframe, downstream constants). Also include at least one manual verification (e.g., hand-count events in one file) demonstrating that “0 entries” is a parsing artifact rather than absence of behavioral events.
-
Operational definitions of behavioral metrics and edge-case handling are under-specified, which weakens the methodological value even if parsing were fixed. In particular, the meaning of “Absolute\_Time” (wall-clock vs elapsed), phase start references, and the hard-coding of $10,800\,\mathrm{s}$ latency ($3\,\mathrm{h}$) imply censoring/right-truncation that is not acknowledged as a modeling issue (Sec. 2.2; also ambiguity noted in Sec. 2.2 latency definition).
Recommendation: In Sec. 2.2, define each metric with unambiguous computation rules, units, and time origin: specify whether latency is (time of first correct entry $-$ phase start time) and where phase start time is obtained. Explicitly describe handling of: no-correct-entry trials (right-censoring vs imputation), repeated rapid entries, missing box IDs, invalid action codes, and non-monotonic timestamps. If $10,800\,\mathrm{s}$ is a censoring threshold, label it as such and state how censoring would be modeled (or, if not modeled, justify and discuss bias). Provide a single summary table of all metrics (names, phases, definitions, units, edge-case rules) and ensure names are consistent throughout (Sec. 2.2, Sec. 3.2.1, Sec. 3.3).
-
Neuroimaging/TBV quantification is inconsistently described and scientifically fragile as written. Methods assume 4D DTI with multiple $b=0$ volumes and prior skull-stripping, while Results indicate the inputs were 3D preprocessed images; TBV is computed via intensity $>$ $0$ voxel counting, which is sensitive to nonzero background and imperfect skull stripping. There is also an internal inconsistency about whether TBV extraction failed (empty plots) or succeeded but was unusable due to behavior (Sec. 2.3, Sec. 3.2.2; Fig. 4 caption vs text).
Recommendation: Rewrite Sec. 2.3 to clearly separate initial assumptions from the actual data encountered: report actual NIfTI dimensionality, voxel sizes, and what preprocessing had already been applied (motion/eddy, skull stripping, registration), including tools if known. Then justify and QC the TBV approach: show background intensity distributions/histograms, confirm that nonbrain voxels are truly zero (or adopt a more defensible brain mask–based volume), and include visual overlays for a subset. In Sec. 3.2.2, report how many bats yield valid TBV values after QC, plus mean/SD/range, and explicitly reconcile why any TBV-related plots are empty (e.g., merge/filtering bug, NaNs, plotting code) if TBV exists.
-
Methods vs Results are partially conflated, and Sec. 2.4 appears truncated/misplaced (including a stray fragment). The regression formula is also repeated in an odd location (Sec. 3.1), making it unclear what was planned vs what was executed, and on which dependent variables (Sec. 2.4, Sec. 3.1, Sec. 3.3).
Recommendation: Reorganize to cleanly separate (i) intended modeling plan (Sec. 2.4) from (ii) what was actually run and why it is invalid (Sec. 3.3). Fix Sec. 2.4 truncation and present the full model specification once (display equation), clearly listing intended outcomes and assumptions. In Sec. 3.3 add a compact table per dependent variable: $N$, missingness, variance/unique values, censoring fraction, whether the model was fit, and whether outputs were invalid (NaN, singular, $-\infty$ $R^2$). Remove the formula from Sec. 3.1 unless explicitly needed for Results narrative.
-
Statistical modeling is demonstrably invalid given constant/degenerate outcomes, but the manuscript does not turn this into explicit, automated “fail-fast” guardrails; additionally, there is at least one internal inconsistency suggesting a model where $\mathrm{DNAmAge}_{\mathrm{scaled}}$ predicts itself (“predict $\mathrm{DNAmAge}_{\mathrm{scaled}}$ reported a significant effect of $\mathrm{DNAmAge}_{\mathrm{scaled}}$”) (Sec. 3.3 vs Sec. 2.4).
Recommendation: First, correct the dependent-variable naming/error in Sec. 3.3 to match the model in Sec. 2.4 (or explicitly provide the alternative model if DNAmAge was ever an outcome). Second, add explicit pre-model validation criteria in Sec. 3.3/Sec. 4.4 (and ideally in code): checks for variance $>0$, minimum unique values, plausible ranges, missingness thresholds, and censoring proportion; if violated, abort model fitting and emit a structured error. Explicitly state that any apparent significance under constant outcomes is a red-flag artifact and should not be reported as a result.
-
Reproducibility is not yet adequate for a paper whose core contribution is pipeline robustness/failure analysis: readers cannot inspect code, environment, logs, or a minimal reproducer of the Excel-format problem (Sec. 2.1–2.4, Sec. 3.2.1).
Recommendation: Provide (preferably in a public repository and/or Supplement): (1) scripts/notebooks with a commit hash, (2) computational environment (Python + package versions; OS), (3) representative log outputs showing the parsing failures and validation summaries, and (4) a small synthetic dataset that mimics the problematic Excel structure (e.g., multi-row headers, merged cells, shifted columns) so others can reproduce the failure and the fix without access to sensitive raw data. If raw data cannot be shared, state constraints explicitly and provide synthetic stand-ins + schema documentation.
-
The manuscript’s lessons are currently somewhat ad hoc and lightly connected to established best practices in multimodal data organization/QC (e.g., BIDS-like conventions, schema validation, unit testing, standardized neuroimaging QC). This limits “bigger picture” impact (Sec. 1, Sec. 3.4, Sec. 4.4).
Recommendation: Expand Sec. 1 and Sec. 4.4 to situate the case study in existing methodological frameworks: cite and briefly map observed failure modes (format assumptions, silent empty parses, dimension mismatches, degenerate outcomes) onto best practices such as schema validation, unit/integration tests, staged QC reports, and standardized data layouts (e.g., BIDS principles where applicable). Then translate that into a concrete checklist/table: Assumption $\rightarrow$ Quick validation $\rightarrow$ Failure signature $\rightarrow$ Automated guardrail $\rightarrow$ Remediation.
-
Internal inconsistencies in cohort summaries and reporting reduce credibility and make it harder to track what dataset underlies each stage: sex counts differ (18/15 vs 19/14) and DNAmAge summary differs ($9.84\pm1.91$ vs $9.60\pm1.74$ with same range). These are small but foundational bookkeeping errors for a pipeline-validation paper (Sec. 2.1 vs Sec. 3.1; Fig. 1/caption).
Recommendation: Recompute and reconcile all cohort descriptors from the final analytic dataframe used for harmonization (the $33$ bats). Report the definitive sex/origin counts and DNAmAge mean/SD/range once (and reuse consistently across Sec. 2.1, Sec. 3.1, and Fig. 1). If differences arise from rounding or different inclusion filters (e.g., before/after exclusions), label them explicitly and show a short included-vs-excluded comparison table.
-
Figures are currently not optimally serving the methodological narrative: several are low-resolution, some panels are redundant due to constants/emptiness, and some plots can be misleading (e.g., regression lines on invalid outcomes; empty TBV panels without clear cause) (Figs. 1–6; Sec. 3.2–3.3).
Recommendation: Export figures as vector (PDF/SVG) or $\geq300$ dpi, enlarge fonts, add panel labels, axis units, and sample-size annotations. Replace redundant constant/empty multi-panels with compact diagnostic tables (min=max, SD=0, \%missing) and/or more informative pipeline diagnostics (e.g., printed column-name mismatches, post-merge missingness heatmaps). Remove/gray out fitted regression lines where modeling is invalid and annotate panels explicitly as “artifact of parsing failure.”