-
Empirical results are not executed: Sec. 3 states supervised fine-tuning/evaluation could not be completed due to a data processing error and that quantitative outcomes are “hypothesized”, yet the manuscript reports specific performance numbers, standard deviations, and statistical significance claims (Tables 1–2; Abstract; Sec. 3 narrative; Secs. 4.3–4.4) as if they were measured. This is not merely “overclaiming”: it presents fabricated metrics and invalidates the paper as an experimental contribution.
Recommendation: Choose one of the following and revise consistently end-to-end (Abstract $\rightarrow$ Sec. 4): (a) Fix the processing/training pipeline, rerun all experiments, and replace all hypothesized numbers with real outputs (per-fold + aggregated), including uncertainty (e.g., bootstrap CIs over participants) and properly specified statistical tests; or (b) remove Tables 1–2 and all numeric/statistical claims, and reframe the paper explicitly as a methodological/protocol + qualitative analysis paper (UMAP only), with claims limited to what is actually demonstrated. In either case, perform a line-by-line audit of Abstract, Sec. 3, and Secs. 4.3–4.4 to ensure there is no language implying validated quantitative superiority when none is presented.
-
Dataset and labeling protocol are under-specified, preventing interpretation and reproducibility. The manuscript does not clearly define what a “step event” label corresponds to (heel strike vs toe-off vs another gait event), how labels were obtained (e.g., motion capture, pressure insoles, manual annotation), expected timing accuracy, and the recording protocol (activities, environment, duration, walking/non-walking distribution). This is critical because the evaluation uses a $\pm0.2\,\mathrm{s}$ tolerance and the model predicts single-sample impulses (Secs. 2.1.1–2.1.2, 2.4.1).
Recommendation: Expand Sec. 2.1.1–2.1.2 with: dataset name/source (and whether it is used as-is or subset/merged); sensor setup and synchronization assumptions (hip vs wrist recorded simultaneously or treated independently); activity protocol and duration per participant; definition of step event (exact biomechanical event); label acquisition method and timing precision; and class composition (approximate walking vs non-walking share). If labels come from another paper/dataset, explicitly cite and summarize the annotation procedure and known error bounds.
-
Potential split/leakage ambiguity for SSL pretraining: the paper uses participant-grouped CV for supervised fine-tuning, but it appears SSL pretraining may be done using “all participants within each sensor condition” (Secs. 2.2, 2.3.1). If unlabeled data from validation participants is used for representation learning, downstream evaluation becomes transductive with respect to the validation set, which can inflate performance and changes the claim being made.
Recommendation: Make the SSL pretraining/evaluation setting explicit in Sec. 2.2 and Sec. 2.3.1 and align the pipeline accordingly: (i) Preferred for clean claims: within each CV fold, pretrain SSL using only the fold’s training participants (unlabeled), then fine-tune/evaluate on held-out participants; or (ii) If pretraining globally on all participants, clearly state this is transductive w.r.t. unlabeled validation data, justify why it matches the intended deployment, and (if possible) add a comparison to fold-restricted pretraining to quantify the effect.
-
Evaluation protocol is ambiguous in ways that can materially change reported F1/MAE/MAPE and any fairness/robustness conclusions: (a) peak-finding parameters and thresholding are underspecified (Sec. 2.4.1: threshold given only as an example “e.g., 0.5”); (b) the one-to-one matching policy within the $\pm0.2\,\mathrm{s}$ tolerance is not defined (multiple predicted peaks near one true step and vice versa); (c) aggregation over windows/participants/folds (macro vs micro) is unclear; (d) the Wilcoxon testing plan lacks unit of analysis and multiplicity handling (Secs. 2.4.1–2.4.3).
Recommendation: Rewrite Sec. 2.4.1–2.4.3 to be fully operational: specify the exact peak detection implementation (library/function), threshold selection procedure (fixed vs tuned; where tuned), minimum peak distance (in seconds/samples), and any smoothing. Define the matching algorithm (e.g., one-to-one greedy nearest-neighbor with exclusivity, or Hungarian) so TP/FP/FN are well-defined. In Sec. 2.4.2 specify precisely how MAE/MAPE and F1 are computed per participant and then aggregated across participants and folds (macro vs micro). In Sec. 2.4.3 define the paired samples for Wilcoxon (e.g., participant-level metric averaged over that participant’s validation appearances), $\alpha$ level, and multiple-comparison correction across sensor conditions/metrics.
-
Core model/training details are insufficient for reproducibility and for judging whether the design is appropriate for extremely sparse targets (especially at $25\,\mathrm{Hz}$). Missing: architecture specifics (channels/kernels/strides/padding, U-Net depth, skip connections), optimizer and schedule, batch size, epochs, NT-Xent temperature, projection head details, normalization, weight decay/dropout, and Focal Loss parameters ($\gamma$ and $\alpha$/class weighting) (Secs. 2.2.2, 2.3.2–2.3.3).
Recommendation: Add a compact but complete implementation specification (Secs. 2.2.2 and 2.3.2–2.3.3 or an Appendix): layer-by-layer encoder and decoder tables; projection head definition (if any) and embedding dimension; training hyperparameters for SSL and fine-tuning (optimizer, LR, schedule, epochs, batch size, temperature, augmentations and their magnitudes/probabilities); Focal Loss equation and the exact $\gamma/\alpha$ used; early stopping criteria; and compute budget (hardware/runtime). This is especially important given label sparsity (single-sample positives at $25\,\mathrm{Hz}$).
-
Windowing/label construction choices are not fully specified and may create boundary artifacts and optimization instability: SSL windows are stated as non-overlapping $2.56\,\mathrm{s}$ (Sec. 2.2.1), but supervised window stride/overlap is unclear; handling of steps at window boundaries is unspecified; and it is unclear whether training uses any label dilation/smoothing (a single $1$ at the exact sample index) which is extremely sparse at $25\,\mathrm{Hz}$ (Secs. 2.3.2, 2.4).
Recommendation: In Sec. 2.3.2 explicitly state supervised training/inference window stride/overlap and how boundary events are handled. Specify how timestamps map to sample indices (rounding policy) and how multiple steps within a window are encoded. Consider (and report) a sensitivity/ablation: overlap vs non-overlap; label dilation (mark $\pm k$ samples positive) vs single-sample impulses; and/or training directly on event times (if applicable). At minimum, justify the current choices and discuss expected boundary effects.
-
Interpretability claims from UMAP risk being overstated and potentially confounded. Coloring embeddings by “contains a step” can show correlation, but separation may be driven by participant identity, activity type, or amplitude/orientation differences (especially wrist), and UMAP is sensitive to hyperparameters. UMAP settings and sampling details are not reported; several figure references are placeholders (Sec. 2.5; Sec. 3: “Figure ??”).
Recommendation: Tighten Sec. 2.5 and related Sec. 3 text: report UMAP hyperparameters ($n_\mathrm{neighbors}$, min_dist, metric, random_state), feature preprocessing, and sampling scheme (how many windows per participant/condition). Add at least one quantitative complement such as linear-probe performance (e.g., logistic regression to classify step/non-step from frozen features), $k$-NN accuracy, or silhouette score. To address confounds, add plots/analyses colored by participant and (if available) activity type. Ensure all referenced figures exist and are numbered consistently (remove “Figure ??” and “not shown” references).
-
Demographic subgroup definitions/counts are inconsistent across the manuscript, undermining the stated stratification and any fairness/robustness narrative: Sec. 2.1.2 uses different age bins than Table 2, and sex counts differ by $1$ between Sec. 2.1.2 and Table 2 (Secs. 2.1.2, 2.3.1, Sec. 3 Table 2). With $n=39$, subgroup comparisons are also likely underpowered without uncertainty reporting.
Recommendation: Standardize age bins and sex counts across Sec. 2.1.2, the stratification description (Sec. 2.3.1), and Table 2 so totals match $n=39$ exactly and definitions are consistent. Once real results exist, report subgroup metrics with uncertainty (e.g., bootstrap CIs) and explicitly note limited power; avoid strong fairness claims unless supported by statistically and practically meaningful evidence.
-
Related work and novelty positioning are not yet cohesive. The manuscript references SSL and step detection in a scattered way but lacks a focused comparison to (i) classical step-event/step-count pipelines (filtering + peak detection, heuristic thresholds), (ii) supervised deep sequence models for gait/step event detection, and (iii) prior SSL-for-HAR/time-series work. This makes it difficult to evaluate what is genuinely new beyond applying a standard SSL + downstream model template (Secs. 1–2).
Recommendation: Add a dedicated Related Work subsection (e.g., Sec. 1.1) that separately covers: step event detection/counting methods (signal-processing and deep learning), SSL for time-series/HAR (contrastive and non-contrastive), and multi-domain/multi-sensor robustness approaches (domain adaptation, multi-rate learning). Then state clearly what this paper contributes (e.g., dense event formulation + multi-fidelity comparison + representation visualization) and calibrate novelty claims accordingly.