-
Ambiguity/possible invalidity in participant-level step reconstruction under overlapping windows. The method uses $2$ s windows with $1$ s stride ($50\%$ overlap) and states that participant total steps are obtained by summing window-level predictions (Sec. 2.2.1–2.2.2, Sec. 2.4.2, Sec. 3.1). With overlapping windows, naive summation can double-count contributions unless the labeling/aggregation is defined to be overlap-consistent (e.g., each true step assigned to exactly one window, or overlap-corrected weighting). As written, it is unclear whether each ground-truth step can appear in multiple window labels; if so, both training targets and reconstructed totals become scale-dependent on the overlap choice, and cross-configuration comparisons may be biased.
Recommendation: In Sec. 2.2–2.4, explicitly define (i) how step timestamps are mapped to windows under overlap (including boundary steps), (ii) whether a step can contribute to multiple window labels, and (iii) the exact aggregation used to recover participant totals from overlapping predictions. Ensure the aggregation is mathematically consistent (e.g., use non-overlapping windows for total reconstruction; or assign each step to a unique window; or apply a principled weighting such that summing yields an unbiased total). Add a small schematic/pseudocode example (appendix is fine) and explicitly confirm that Hip $100$ Hz LOSO totals are computed only from held-out predictions using the same policy.
-
Framing as a “transfer learning framework” is not matched by the implemented methods: the paper evaluates only zero-shot transfer of a single LightGBM model trained on Hip $100$ Hz (Sec. 1, Sec. 2.3–2.4, Sec. 4). This is still a useful robustness evaluation, but current narrative/title may overstate methodological novelty and generality.
Recommendation: Either (A) reframe throughout (title, abstract, Sec. 1, Sec. 4) as a systematic evaluation/benchmark of zero-shot cross-configuration robustness for step counting, or (B) add at least one concrete adaptation baseline consistent with “transfer learning” (e.g., simple target calibration, CORAL/feature alignment, per-configuration normalization, small labeled fine-tuning, or multi-configuration training) and compare against zero-shot transfer.
-
Insufficient specification of window labels, regression target, and output post-processing. It is unclear whether zero-step windows are included, how negative predictions are handled (regression can output negatives), whether predictions are rounded/clipped before aggregation, and how boundary cases are treated (Sec. 2.2.1–2.2.2, Sec. 2.4.2, Sec. 3.1). These choices directly affect systematic underestimation and the participant-level totals.
Recommendation: In Sec. 2.2–2.4, document: inclusion/exclusion of zero-step windows; whether targets are integers or real-valued; whether model outputs are constrained (clip to $\geq 0$, rounding strategy); and how boundary steps are counted. Report an ablation or sensitivity check showing how these choices affect MAE/MAPE and bias (especially underestimation) across configurations.
-
Missing key modeling/training details hinder reproducibility and interpretation of failure modes. LightGBM hyperparameters, objective, number of trees/iterations, regularization, early stopping, feature handling (scaling/normalization), random seeds, and any hyperparameter tuning protocol are not adequately reported (Sec. 2.2.2, Sec. 2.3.1, Sec. 3.1). It is also ambiguous whether cross-configuration results use one final model trained on all Hip $100$ Hz data or fold-specific LOSO models reused for target inference.
Recommendation: Add a dedicated subsection (Sec. 2.3) listing all LightGBM settings (e.g., boosting type, objective, $\text{learning\_rate}$, $\text{n\_estimators}$, $\text{num\_leaves}$, $\text{max\_depth}$, $\text{min\_data\_in\_leaf}$, subsampling/feature_fraction, L1/L2), tuning method (defaults vs. search; nested CV or not), early stopping, and random seed(s). Clearly state whether target inference uses a single final Hip $100$ Hz model or the ensemble of LOSO-trained folds; ensure the policy is consistent and justified.
-
Sampling-rate handling ($25$ Hz) and frequency-feature definitions are under-specified, threatening comparability across configurations. It is unclear whether $25$ Hz is natively recorded or downsampled from $100$ Hz, and if downsampled whether anti-alias filtering was applied (Sec. 2.1, Sec. 2.2.2). FFT-based features (dominant frequency, spectral energy) depend on sampling rate, window length, FFT normalization, and one-sided vs two-sided spectra; without precise definitions, feature scale may change purely due to sampling-rate differences, confounding “transfer failure” conclusions (Sec. 2.2.2, Sec. 3.2–3.3).
Recommendation: In Sec. 2.1 and Sec. 2.2.2, specify the acquisition path for $25$ Hz (native vs resampled) and the exact resampling pipeline (filter type/order/cutoff). Precisely define FFT computation and spectral energy (normalization, one- vs two-sided spectrum, scaling with $N$). Consider normalizing spectral features to be sampling-rate comparable (e.g., power spectral density or per-Hz normalization) and report whether such normalization reduces Hip $25$ Hz degradation.
-
Lack of critical baselines limits interpretability of the observed degradation. Without (i) in-domain models trained/evaluated within each target configuration and (ii) a simple signal-processing/heuristic step counter, it is unclear whether target settings are intrinsically harder or whether the degradation is primarily cross-configuration mismatch (Sec. 2.3–2.4, Sec. 3.2–3.3, Sec. 4). Considering only one model family (LightGBM) and only one source domain (Hip $100$ Hz) further limits generality.
Recommendation: Add: (1) LOSO within-configuration baselines for Hip $25$ Hz, Wrist $100$ Hz, Wrist $25$ Hz (same features/model) to quantify “best achievable” within each configuration; (2) at least one simple heuristic baseline (e.g., bandpass $+$ peak detection on SVM) per configuration; and ideally (3) a second model class (e.g., random forest or linear model) or an alternate source configuration (e.g., Wrist $100$ Hz) to test whether failure modes are model/source-specific. If infeasible, narrow claims in Sec. 4 to this specific LightGBM/Hip-$100$ setup and elevate missing baselines as a primary limitation.
-
Error analysis is not deep enough to support some of the broader interpretations (including demographic conclusions). The paper emphasizes MAE/MAPE/Std_AE and qualitative underestimation claims but does not report signed bias, predicted-vs-true correlation, or error dependence on total steps/activity composition (Sec. 3.3–3.5). Demographic analysis (sex, coarse age bins) is underpowered at $n=39$ and “non-significant” results are described in a way that may be read as “equitable/independent,” which is stronger than supported (Sec. 2.6.2, Sec. 3.4, Sec. 4).
Recommendation: In Sec. 3.2–3.5, add: mean signed error (bias) and its CI; predicted vs. true scatter plots with Pearson/Spearman and/or $R^2$ per configuration; error vs. true total steps plots; and (if available) bout-level or intensity-stratified analysis to localize failure (missed bouts vs within-bout undercounting). For demographics, report effect sizes (e.g., rank-biserial correlation, eta-squared) and confidence intervals, and rephrase conclusions as “no evidence detected” given limited power; optionally use regression/ANCOVA controlling for total steps or other covariates if available.
-
Dataset/protocol description is insufficient to judge generalizability. Critical context is missing: recording duration (per participant, range), number of sessions, free-living vs scripted activities, activity mix, device make/model and dynamic range, placement/orientation details, non-wear handling, and annotation procedure quality (synchronization, number of annotators, reliability) (Sec. 2.1, Sec. 2.1.1).
Recommendation: Expand Sec. 2.1–2.1.1 with: device specs (make/model, range), placement/orientation (hip side, wrist side), duration and sessions statistics, activity contexts and mix, missing-data/non-wear criteria, and ground-truth annotation workflow (synchronization method, annotators, inter-rater agreement if available). In Sec. 4, explicitly scope generalization claims to similar populations/devices/contexts.
-
Related work and positioning are not sufficiently focused on cross-placement/cross-device step counting and wearable domain adaptation; some citations appear tangential (Sec. 1, Sec. 4). This weakens the “bigger picture” justification and makes it harder to see what is new beyond the empirical finding that naïve transfer fails.
Recommendation: Add a structured Related Work section (Sec. 1 or new Sec. 2.x) covering: (a) traditional pedometer/step-counting on hip vs wrist, (b) prior cross-placement generalization studies in step counting/activity recognition, and (c) wearable domain adaptation/transfer learning methods. Replace/relocate tangential references unless directly connected to the paper’s methods or hypotheses. Clearly state the paper’s novelty (dataset setup $+$ evaluation protocol $+$ quantified degradation) and limit claims accordingly.