-
Step counting from overlapping window predictions is under-specified and may be ill-posed as described, which undermines interpretation of $\text{MAE}/\text{MAPE}$ (Sec. 2.4.2, Sec. 3.3–3.3.2). The paper states that window-level step/non-step outputs are “aggregated” into trial-level step counts, but does not define the aggregation algorithm. With 50% overlap and a “positive if annotation is in the central 25% of the window” labeling rule (Sec. 2.2), naive summation of positive windows will not generally equal the number of steps; different debouncing/peak-picking rules can change $\text{MAE}/\text{MAPE}$ substantially.
Recommendation: In Sec. 2.4.2, provide an explicit, reproducible algorithm (or mathematical definition) mapping window probabilities/labels to step-event counts per trial: threshold choice (fixed vs tuned on validation), how contiguous positives are merged, any refractory period, peak-finding/non-maximum suppression, and how window indices map to time. Define clearly how $\text{MAE}$ and $\text{MAPE}$ are computed/averaged (per-trial then averaged vs global sums; macro vs micro averaging; handling of small denominators). Consider adding an event-based evaluation (step-event precision/recall with a tolerance window) that more directly matches the “central 25%” event-labeling scheme, and/or a regression formulation (steps-per-window / step-rate) as a sensitivity analysis.
-
Insufficient architectural and training details prevent replication and make it hard to interpret the negative results (Sec. 2.3.1–2.3.2). Critical missing information includes the full 1D-CNN layout (layers, kernel sizes/strides/pooling, activations, normalization, dropout), embedding dimensionality, adversarial head architectures/capacities, optimizer and hyperparameters (learning rate, schedule, batch size, weight decay), number of epochs/steps, early stopping/checkpoint selection (epoch 20 is mentioned), and the exact loss formulation/weighting (task vs location/frequency adversaries; GRL $\lambda$ and any $\alpha/\beta$ weights).
Recommendation: Expand Sec. 2.3 with a precise specification of (i) encoder and head architectures (ideally a table of layers/parameters) and (ii) training protocol. In Sec. 2.3.2, write the full objective explicitly (e.g., $L = L_{\rm task} + \alpha\,L_{\rm loc} + \beta\,L_{\rm freq}$ with GRL $\lambda$ and whether $\lambda$ is shared or per-adversary), and state all hyperparameters and selection criteria. Add an Appendix with a full config and/or pseudocode sufficient for exact reproduction.
-
No baselines and limited ablations make it unclear whether the observed failures are due to adversarial training, model capacity, optimization choices, or dataset difficulty (Sec. 3.3–3.4, Sec. 3.6). The manuscript does not report: (i) a task-only version of the same CNN, (ii) single-adversary variants (location-only, frequency-only), (iii) sensor-specific models (hip-only, wrist-only), or (iv) simple classical/heuristic baselines. Without these, it is hard to attribute performance/invariance outcomes to the GRL strategy or to judge whether 44% $\text{MAPE}$ is expected/competitive.
Recommendation: Add at least: (1) task-only CNN baseline, (2) frequency-adversary-only and location-adversary-only ablations, and (3) a simple baseline (e.g., a classical peak-based step counter or shallow classifier on handcrafted features), plus optionally hip-only/wrist-only specialized models. Report both downstream metrics (Sec. 3.3) and probe metrics (Sec. 3.4) for all baselines. If compute is a constraint, prioritize task-only and single-adversary ablations and clearly qualify conclusions about adversarial learning.
-
Demographic/fairness conclusions are fragile given the very small test set and lack of uncertainty quantification (Sec. 3.1, Sec. 3.5–3.5.2). The paper reports large disparities (e.g., older adults much worse), but with only 6 test participants, subgroup metrics can be dominated by one participant/trial. The manuscript does not report subgroup sample sizes in the test set (participants and trials), participant-level distributions, or confidence intervals.
Recommendation: In Sec. 3.1 and Sec. 3.5, report exact counts: number of test participants (and trials/windows) per sex and per age bin. Add participant-level plots or summaries (e.g., per-participant $\text{MAE}/\text{MAPE}$) and bootstrap confidence intervals at the participant level. If feasible, repeat experiments over multiple participant-level random splits (e.g., 10–50 repeats) or use leave-one-subject-out / group $k$-fold CV, reporting variability for overall, invariance, and subgroup metrics. Temper fairness claims in Sec. 3.6–Conclusions to reflect uncertainty and the current cohort size.
-
Claims about sampling-frequency invariance are not statistically supported and lack appropriate baselines/metrics (Sec. 3.4, Sec. 3.6, Conclusions). A frequency probe accuracy of $\sim 59\%$ vs “chance 50\%” may or may not be meaningful depending on class balance and variance, and without a non-adversarial comparison it is unclear how much the adversary reduced frequency information.
Recommendation: In Sec. 3.4, report class proportions for the probe datasets and use balanced accuracy and/or AUC in addition to raw accuracy; define the chance baseline explicitly (e.g., majority-class accuracy). Provide uncertainty (bootstrap CIs). Add probe results for a task-only model to quantify how much frequency information is removed by the adversarial objective. Rephrase claims in Sec. 3.6/Conclusions to “partial attenuation” unless the reduction is clearly demonstrated and statistically meaningful.
-
Resampling/upsampling design may confound the “frequency invariance” question (Sec. 2.2, Sec. 3.4). Upsampling 25 Hz windows to 200 samples via interpolation enforces a common input size but also changes signal smoothness/bandwidth and may create artifacts that either hide or introduce frequency cues. This makes “invariance to sampling frequency” hard to disentangle from “invariance to interpolation-induced differences.”
Recommendation: Add a sensitivity analysis comparing alternative standardizations: (i) downsample 100 Hz to 25 Hz, (ii) resample both to a common intermediate rate, and/or (iii) architectures that avoid fixed-length resampling (variable-length models, time encodings). Report how downstream performance and frequency-probe separability change across these preprocessing choices (Sec. 3.3–3.4).
-
Probe methodology details are incomplete, and potential leakage/overfitting in probe training is not ruled out (Sec. 2.4.2, Sec. 3.4). The paper says probes are trained on frozen embeddings, but does not fully specify which splits are used for probe training/validation/testing, whether hyperparameters are tuned, and whether evaluation is strictly on held-out participants. Given a large number of overlapping windows, probes can also be sensitive to subtle artifacts; reporting only accuracy can be misleading under imbalance.
Recommendation: In Sec. 2.4.2, specify the probe protocol precisely: which participant splits are used, whether probes are trained on train (or train+val) embeddings and evaluated on test embeddings, any cross-validation, regularization, and hyperparameter tuning procedure. Ensure the probe is not trained and evaluated on the same embeddings. In Sec. 3.4, report balanced accuracy/AUC and class proportions, and consider reducing dependence among overlapping windows by subsampling windows per trial or aggregating embeddings per step/trial for the probe as a robustness check.
-
Table 3 is misformatted and appears to mix step-counting metrics with probe-related columns, making core results difficult to read and potentially error-prone (Sec. 3.3.2–3.4). This is a high-impact presentation issue because it affects the interpretability of the main per-condition results.
Recommendation: Rebuild the tables so that per-condition step-count metrics (hip/wrist $\times$ 25/100 Hz; $\text{MAE}/\text{MAPE}$) are presented cleanly in Table 3 (Sec. 3.3.2), and move probe tasks/accuracies/interpretations into a separate Table 4 (Sec. 3.4). Cross-check that all values quoted in the text match the corrected tables and remove any LaTeX/OCR artifacts causing column mixing.