-
Corrupted/irrelevant content in the dataset demographics section undermines confidence in the manuscript. In Sec. 2.1.2, Table 1 (Participant Demographics Summary) contains unrelated text about literature search/retrieval toolkits/RAG systems and spurious citations, interleaved with partial demographic fragments. This makes it impossible to verify cohort composition and invalidates downstream demographic discussion and tests (Sec. 3.6, Sec. 4.2–4.3).
Recommendation: Rebuild Sec. 2.1.2 and replace Table 1 with a correct demographics table for this study only ($N=39$), including sex distribution, age summary (mean$\pm$SD and/or bins), and any recorded covariates used later (e.g., height/weight/BMI if analyzed). Remove all unrelated retrieval/RAG/toolkit content and spurious citations. Then audit Sec. 3.6 and Sec. 4.2–4.3 so every demographic count/percentage matches the corrected Table 1.
-
Label generation / target definition is corrupted and underspecified, making the learning problem irreproducible and potentially ill-posed. Sec. 2.1.3 is truncated and followed by a nonsensical “Table 3. Label Generation” containing repeated numeric strings. It is unclear how step annotations (timestamps) are aligned to accelerometer samples at $100~{\rm Hz}$ vs $25~{\rm Hz}$; whether labels are single-sample impulses vs intervals; what happens if timestamps fall between samples; how overlapping $2~{\rm s}$ windows are labeled; and how boundary cases (steps near window edges) are treated. Given the models predict per-sample probabilities and the paper’s central claim hinges on extreme sparsity/imbalance (Sec. 3.3), this missing detail is critical.
Recommendation: Rewrite Sec. 2.1.3 as a precise, step-by-step labeling specification: (1) annotation source and format (timestamps, heel-strike definition, per-foot vs per-stride); (2) synchronization/alignment to accelerometer indices for both sampling rates; (3) exact label target used for training (impulse at one index vs pulse width $\pm N$ samples vs Gaussian bump); (4) how labels are generated inside overlapping windows and how duplicates/overlaps are resolved; (5) handling of first/last partial windows and recording boundaries. Delete the corrupted “Table 3” and replace it with a clean schematic (timeline $+$ windowing) and/or a compact rule table.
-
The diagnosis “deep learning fails due to extreme class imbalance” is plausible but not yet demonstrated; alternative explanations (label misalignment, overly sparse targets, calibration/thresholding/post-processing choices) are not ruled out. The paper does not report basic imbalance statistics (positive prevalence per sample/window, steps per minute/hour, per-fold ratios), nor does it show diagnostics like probability histograms, confusion matrices, PR curves/PR-AUC, or example probability traces aligned with ground truth (Sec. 3.3; many figures show only loss). Without these, the observed “zero steps after post-processing” could stem from an overly strict fixed threshold (0.5), peak-finding settings, or attenuation from window-stitching rather than total lack of learned signal.
Recommendation: Augment Sec. 3.3 (and methods in Sec. 2.3.1–2.3.2) with decisive diagnostics: (i) report positive class prevalence (per-sample and per-window) for each condition and per fold; (ii) add sample-level PR curves and PR-AUC (preferred under extreme imbalance) and/or ROC-AUC for raw probabilities before peak detection; (iii) show confusion matrices at representative thresholds; (iv) provide representative reconstructed probability time series for hip and wrist with ground-truth step times overlaid; (v) compare trained models to a trivial “always no-step” predictor in terms of loss and PR-AUC. If PR-AUC is near random and probabilities concentrate near zero, the imbalance-collapse claim is strengthened; if PR-AUC is non-trivial, revisit post-processing/thresholding.
-
Deep-model post-processing appears fixed and may hard-code failure or mask partial learning. The pipeline (Sec. 2.3.1) averages overlapping window outputs (“stitching”), then applies $\texttt{find\_peaks}$ using a fixed probability threshold (reported as $0.5$) and fixed minimum peak distance ($25$ samples at $100~{\rm Hz}$, $6$ at $25~{\rm Hz}$). Under severe imbalance, probability calibration can be conservative, so meaningful peaks may exist below $0.5$; averaging across misaligned windows can further attenuate peaks. Additionally, it is unclear whether peak parameters were tuned per fold/condition (baseline is tuned, but DL post-processing seems fixed), raising fairness concerns.
Recommendation: In Sec. 2.3.1 and Sec. 3.3, (i) perform a threshold sweep (and optionally min-distance sweep) on the validation set within each training fold, report F1 vs threshold curves, and choose thresholds per condition in a leakage-free way; (ii) report sensitivity to stitching strategy (mean vs max vs median) and show whether averaging attenuates peaks; (iii) clearly state and justify the min-distance choice with respect to annotation definition (step vs stride) and plausible cadence. If no threshold yields non-trivial F1/PR-AUC, this supports true collapse; otherwise, update the conclusion from “no learning” to “learned signal but mis-calibrated/mis-postprocessed.”
-
Imbalance-mitigation exploration is too narrow to support broad claims about deep learning being unsuitable here. The CNNs are trained only with one focal-loss setting ($\gamma=2$, $\alpha=0.25$) (Sec. 2.2.4), then declared failed. Given the likely much more extreme imbalance than typical object detection, $\alpha/\gamma$ may be inappropriate; moreover, alternative standard remedies (weighted BCE, balanced sampling, less sparse targets) are not tested.
Recommendation: Extend Sec. 2.2.4 and Sec. 3.3 with targeted ablations (even small-scale) that directly test the paper’s hypothesized failure mechanism: (1) weighted BCE with several positive-class weights; (2) focal loss sweep over $\alpha$ and $\gamma$; (3) class-balanced mini-batches or oversampling windows that contain steps; (4) label smoothing in time (pulse labels $\pm N$ samples or Gaussian bumps) to reduce “single-sample impulse” sparsity; (5) optionally a simpler formulation (window-level “contains step” classification or per-window count regression) as a sanity check that the network can extract gait periodicity. Report results in a compact table per condition (hip/wrist; $100/25~{\rm Hz}$).
-
The manuscript framing as a “comparative analysis” is not supported by the current deep-learning results. Since both CNNs are degenerate across all settings (Sec. 3.3), the paper cannot yet answer the stated trade-off question between lightweight DL and peak detection (accuracy vs efficiency), and risks over-generalizing that lightweight CNNs are inherently unfit for step counting rather than highlighting a specific pipeline/target/imbalance pitfall (Abstract, Sec. 1, Sec. 4.1–4.4).
Recommendation: Revise the framing in Abstract/Sec. 1/Sec. 4 to match what is actually established. Choose one: (i) Recast as a negative-result diagnostic paper centered on why collapse happens and which fixes do/don’t work (preferred if deep models remain weak), or (ii) demonstrate at least one non-trivial DL configuration (via the ablations above) and then present a true comparison vs the baseline including accuracy/efficiency trade-offs. In either case, bound conclusions to the tested conditions and explicitly avoid general statements about deep learning “in general” for step counting.
-
Core reproducibility details are missing or vague across data processing, training, evaluation aggregation, and efficiency measurement, and some manuscript metadata appears placeholder. Examples: preprocessing for DL inputs (normalization, gravity removal, axis alignment), batch size and optimizer schedule, number of windows per fold, shuffling/augmentation, initialization, regularization, and exact stitching/boundary handling are not fully specified (Sec. 2.1–2.3). Inference time reporting is ambiguous about what is included (windowing $+$ stitching $+$ peak detection vs forward pass only) and lacks a comparable runtime for the baseline (Sec. 3.4). The unstructured report also notes placeholder author/affiliation text, which is a serious presentation issue.
Recommendation: Add a detailed reproducibility block in Sec. 2 (or Appendix) covering: (i) exact preprocessing for CNN inputs (normalization strategy, filtering if any, gravity handling); (ii) training hyperparameters (batch size, optimizer, LR schedule, epochs/early stopping, seeds, regularization); (iii) number of windows/samples per fold and how validation is drawn within training folds; (iv) exact stitching algorithm and handling of first/last windows; (v) metric aggregation procedure (per recording vs per subject vs per fold) (Sec. 2.3.2–2.3.3); (vi) inference-time protocol including all steps and hardware/software versions, plus baseline runtime on the same setup (Sec. 3.4). Also replace any placeholder author/affiliation text and run an end-to-end manuscript assembly audit to prevent table/section corruption.