-
Data provenance, protocol, and ground-truth step annotation/synchronization are under-specified, limiting reproducibility and external validity (Sec. 2.1, Sec. 2.2.2, Sec. 3.1). It is unclear what study the $39$ participants come from, what activities/environments were included (e.g., treadmill vs free-living; walking only vs mixed tasks), how step events were obtained (manual video, foot sensors, device algorithm), annotation resolution/quality control, and how hip and wrist streams (and $100$ vs $25~\mathrm{Hz}$ streams) were time-aligned and drift-corrected. These details are critical because labels are derived from step timestamps.
Recommendation: Expand Sec. 2.1 and Sec. 2.2.2 to document: (i) dataset/study origin (new vs reused; cite source if reused), setting and activity protocol, recording durations, inclusion/exclusion; (ii) step annotation method, definition of a “step event,” temporal resolution, QC/validation; (iii) synchronization/alignment procedure across sensors and sampling rates (timestamping, drift handling). Explicitly describe how ambiguous segments (turns, transitions, stairs, shuffling, arm-only motion) were handled, and discuss likely label noise implications in Sec. 3.6/4.
-
Unit of analysis and dependence: AUC is computed once per feature/condition on pooled, overlapping windows across all participants (Sec. 2.4.1, Sec. 3.1–3.4). With $50\%$ overlap and strong within-subject temporal correlation, treating $\sim$hundreds of thousands of windows as independent can inflate effective sample size, obscure between-participant heterogeneity, and overstate “robustness”/generalization to unseen individuals.
Recommendation: Revise Sec. 2.4.1 to adopt a participant-wise evaluation: report per-participant AUCs (or leave-one-subject-out / participant-wise $k$-fold CV) and summarize distributions (mean$\pm$SD or median[IQR]) for each feature and condition in Sec. 3.2–3.4. If you keep a pooled AUC, also report participant-level results as the primary robustness evidence and use pooled results only as a descriptive complement.
-
No uncertainty quantification or formal comparisons for AUC differences (Sec. 2.4.1–2.4.2, Sec. 3.2–3.4, Sec. 3.6). Claims about “minimal degradation,” “robustness,” or location/sampling effects rely on point estimates without confidence intervals/tests, making it unclear which differences are statistically/practically meaningful.
Recommendation: Add uncertainty estimates using a clustering-respecting method: e.g., bootstrap resampling by participant (not by window) to obtain $95\%$ CIs for AUCs and key contrasts (hip vs wrist; $100$ vs $25$; top features vs others). If using participant-level AUCs, perform paired comparisons across participants (with multiple-comparison control where appropriate). Update Sec. 3.6 and Sec. 4 to ground robustness statements in these intervals.
-
Task/label definition creates a mismatch with “step counting” claims: windows are labeled “Step” if $\geq1$ step occurs in a $2$ s window (Sec. 2.2.2), which conflates step presence with step density and “walking-like” movement. Many features (SD/IQR/spectral energy) will scale with the number of steps and intensity within the window; windows with $1$ step vs $4$ steps are treated identically, and mixed-activity windows are possible. This limits how directly conclusions translate to event-level step detection/counting (Sec. 1, Sec. 3.6, Sec. 4).
Recommendation: Clarify throughout (Abstract, Sec. 1, Sec. 3.6, Sec. 4) that the evaluated task is window-level discrimination of step-containing windows (a proxy for walking/movement bouts), not end-to-end step counting. In Sec. 2.2.2 and Sec. 3.1, report the distribution of steps-per-positive-window and discuss mixed-window prevalence and boundary effects. If feasible, add a small sensitivity analysis: alternative labeling rules (e.g., $\geq2$ steps; or majority-of-samples walking), and/or a regression to predict step count per window, to test whether the main feature rankings and “$25~\mathrm{Hz}$ sufficiency” claims hold.
-
Feature definitions and signal-processing details are incomplete, and some are internally inconsistent across sampling rates (Sec. 2.2.1–2.3.2). In particular: (i) ENMO is written with gravity subtraction inside the square root (dimensionally inconsistent) (Intro p.2; Sec. 2.2.1 p.2–3); (ii) it is unclear whether ENMO is truncated at $0$ (common ENMO definition); (iii) peak detection is under-specified (min distance, edge handling, filtering/smoothing) and is likely sampling-rate sensitive; (iv) FFT features lack details (demeaning/DC handling, window function, scaling/normalization, bin selection for $0.5\text{--}3~\mathrm{Hz}$ at $25$ vs $100~\mathrm{Hz}$); (v) spectral energy wording conflicts (“power spectrum” vs “squared magnitudes,” potentially implying $|\mathrm{FFT}|^4$) (Sec. 2.3.2).
Recommendation: Fix and standardize definitions in Sec. 2.2.1–2.3.2: (a) correct ENMO with explicit parentheses and units, e.g., $\mathrm{ENMO} = \max(0, \|\mathbf{a}\| - 1g)$ if truncation is intended; use the same expression everywhere; (b) specify whether $25~\mathrm{Hz}$ is native or downsampled, and if downsampled, detail anti-alias filtering/decimation (Sec. 2.2.2); (c) fully specify peak detection algorithm and parameters (library/function, prominence, min peak distance in seconds, preprocessing, boundary treatment) (Sec. 2.3.1); (d) fully specify FFT pipeline (demean/detrend, DC exclusion, window type, zero padding, PSD normalization) and give explicit formulas for dominant frequency, spectral energy (as a band power/integral), and spectral entropy (including probability normalization and log base) (Sec. 2.3.2). Ensure frequency-band implementation is comparable across $25/100~\mathrm{Hz}$ (bin mapping).
-
The demographic subgroup analysis failed due to a metadata merge/parsing error, but the failure mode and data-integrity implications are not sufficiently diagnosed (Sec. 2.4.3, Sec. 3.1, Sec. 3.5). Without clear verification, readers cannot be sure only demographics were impacted (and not participant IDs, file-to-subject mapping, or label alignment).
Recommendation: In Sec. 2.4.3 and Sec. 3.5, provide a concrete diagnosis (which keys failed, how IDs differed, what rows were missing/duplicated). Explicitly document integrity checks that confirm the main feature–label dataset is correct (participant-level counts, file hashes/checksums, sanity checks on time ranges). If any participants/sessions were excluded, list them. If feasible, fix the merge and include at least a minimal subgroup summary; otherwise, strengthen generalization caveats in Sec. 4.
-
Internal inconsistency in reported step-event totals: $48,721$ (Methods Sec. 2.1) vs $62,904$ (Results Sec. 3.1). This undermines trust in dataset accounting and downstream results.
Recommendation: Reconcile these numbers by explicitly stating scopes/filters for each count (e.g., before/after exclusions, per-condition vs combined, annotation revisions). Correct the manuscript so the same quantity is reported consistently, and ensure any derived percentages in Sec. 3.1 match the corrected totals.
-
Conclusions about robustness and the sufficiency of $25~\mathrm{Hz}$ are broader/stronger than warranted given the described cohort ($39$ healthy adults), unclear activity context, missing subgroup analyses, and the current pooled-window evaluation (Sec. 3.6, Sec. 4).
Recommendation: In Sec. 3.6 and Sec. 4, restrict claims to the study conditions once fully specified (protocol, population, annotation method). Rephrase statements implying broad deployment across “diverse wearable applications” to note that validation is still needed in older adults, clinical/pathological gait, and multi-day free-living settings. Tie any “$25~\mathrm{Hz}$ is sufficient” claim to specific features and to participant-level uncertainty estimates (after addressing the evaluation issues above).