-
Uncertainty/interval construction is ambiguous and internally inconsistent under overlap, and calibration is not assessed (Sec. 2.4.1–2.4.3; Results Sec. 3.2). The text states an overlap-corrected total prediction $\hat{Y} = (\sum \lambda_i)/2$ (Sec. 2.4.1) but then constructs the Poisson total rate as $\Lambda = \sum \lambda_i$ (Sec. 2.4.3), which would center the interval on a different mean than the reported point estimate. Additionally, 50% overlapping windows violate the independence assumptions typically used to justify summing Poisson variables, and the reported mean 95% interval widths (Table 1 / Sec. 3.2) appear surprisingly narrow given the scale of total steps, raising concern that the implemented formula may not match the written description (or that a heuristic is being used without validation). Finally, the manuscript calls these “confidence intervals” in places, but what is computed appears to be a predictive interval conditional on $\Lambda$.
Recommendation: Make the uncertainty pipeline mathematically and implementationally explicit and consistent: (i) define precisely what $\lambda_i$ represents (expected steps per full window vs per non-overlapping segment) and how overlap is handled; (ii) ensure the same overlap correction is applied to both the point estimate and the Poisson mean used for intervals (e.g., if $\hat{Y} = \sum\lambda_i/2$ then use $\Lambda_{\rm total} = \sum\lambda_i/2$ for the Poisson-based predictive interval), and state how boundary windows are treated; (iii) explicitly acknowledge dependence induced by overlap and label the Poisson-sum interval as an approximation unless you provide a derivation under stated assumptions; (iv) rename to “(95%) prediction interval” unless parameter uncertainty is being estimated; and (v) add an uncertainty calibration analysis in Sec. 3.2/3.4: empirical coverage of nominal 50/80/95% intervals by configuration, plus a check that interval width correlates with absolute error. If coverage is off, consider either non-overlapping windows for interval construction, a dispersion/variance inflation factor, or a post-hoc calibration scaling of $\Lambda$, reporting before/after coverage and width.
-
FP/FN definitions are difficult to interpret and likely inconsistent with overlap-corrected aggregation and continuous Poisson-rate outputs (Sec. 2.4.4; referenced in Sec. 3.1–3.3). FP is described as “sum of predicted steps over windows with zero true steps,” and FN as $\sum \max(0, y_i - \hat{\lambda}_i)$. With overlapping windows, the same underlying time/steps contribute to multiple windows, so window-level sums can double-count. Moreover, using $\hat{\lambda}_i$ (a real-valued mean) yields fractional “FP/FN,” which are not event-level false detections/misses in the conventional sense, making cross-configuration comparisons and interpretation (especially wrist vs hip) potentially misleading.
Recommendation: Rework Sec. 2.4.4 to provide physically interpretable and overlap-consistent error decompositions. Options: (a) compute FP/FN on a non-overlapping timeline (e.g., per-sample, per-second, or per non-overlapping window) by reconstructing predicted counts onto unique time bins, then aggregating; (b) if you keep window-level measures, explicitly correct for overlap (and justify the correction) and rename them to avoid event-detection connotations (e.g., “overcount mass in zero-step windows” and “undercount mass” / “count shortfall”). In either case, give explicit formulas with window indices ($\hat{\lambda}_i$, $y_i$) and state whether you interpret these as expected counts. Recompute and update Sec. 3.1–3.3 and Table 1 accordingly, and add a brief note explaining how these quantities relate to bias and absolute error to prevent apparent contradictions (e.g., similar MAE but different over/under counting patterns).
-
Table 1 (core quantitative results) is corrupted/incomplete and internally inconsistent with surrounding text (Sec. 3.1). The Hip_100Hz row contains narrative text in a numeric cell; FN/CI-width entries appear missing or misaligned for hip conditions; and at least one FP value conflicts with the text below the table (Hip_25Hz vs Hip_100Hz FP attribution). This prevents reliable verification of key claims (best configuration, CI width comparisons, FP/FN differences).
Recommendation: Rebuild Table 1 directly from the stored per-fold/per-subject logs, ensuring each configuration (Hip/Wrist $\times$ 100/25 Hz) has complete, correctly aligned mean $\pm$ SD entries for MAE, MAPE, bias, FP, FN, and interval width. Remove prose from table cells and place clarifications in the caption or main text. Then audit Sec. 3.1–3.2 and Sec. 4 to ensure every numeric claim (e.g., FP values, CI widths, “more than double”) matches the corrected table.
-
Observation model choice (Poisson) is under-justified given overlap, zero-inflation, and likely overdispersion, which directly affects uncertainty claims (Sec. 2.2.2; hinted in Sec. 3.1–3.2). Step counts can be overdispersed relative to Poisson (variance $>$ mean) due to heterogeneous activities, cadence variability, and label noise; and overlapping windows induce strong temporal dependence. Using a Poisson model may understate predictive variance and give misleadingly narrow intervals, especially if the interval is interpreted as “model confidence.”
Recommendation: In Sec. 2.2.2 and Sec. 4 (Limitations), add a concrete diagnostic: report mean/variance of $y_i$ per configuration (and proportion of $y_i = 0$) to assess overdispersion/zero inflation. If overdispersion is present, add at least a sensitivity analysis: Negative Binomial likelihood, quasi-Poisson (variance inflation factor estimated from residuals), or a calibrated dispersion term used when constructing predictive intervals. Explicitly discuss how overlap-induced dependence affects the generative interpretation, and frame the Poisson approach as a pragmatic approximation unless a more principled non-overlapping or point-process formulation is adopted.
-
Missing baseline comparisons make it hard to quantify the incremental benefit of the proposed probabilistic CNN (Sec. 2–3). The experiments compare only variants of the same Poisson-CNN across placements/frequencies. Without (i) a deterministic deep regressor baseline and (ii) a simple classical/heuristic step counter, the reader cannot tell whether the main gain is accuracy, uncertainty quantification, or simply model capacity—and whether uncertainty adds value beyond, e.g., residual-based intervals.
Recommendation: Add at least two baselines under the same LOSO protocol and preprocessing: (i) a deterministic CNN (or LSTM/TCN) trained with MAE/MSE, with uncertainty via residual quantiles or conformal prediction; and (ii) a standard threshold/peak-based step counter (hip/wrist tuned identically across folds if possible). Report the same metrics (MAE/MAPE/bias and your revised FP/FN-style measures) plus Bland–Altman. For uncertainty, compare empirical coverage/width to your Poisson-based intervals. Summarize in Sec. 3.1–3.4 and update Sec. 4 to state clearly what is improved by the probabilistic formulation.
-
Methodological details are insufficient for reproducibility and for evaluating overfitting/leakage risk under LOSO (Sec. 2.1–2.3). Key missing items include: full CNN layer-by-layer specification; preprocessing (normalization, gravity removal, filtering); downsampling/anti-aliasing and time alignment of annotations at 25 Hz; training hyperparameters (batch size, optimizer settings, learning rate schedule, epochs); early stopping criterion/patience; random seeds; and the exact within-fold train/validation split (and confirmation that the held-out subject is not used for validation).
Recommendation: Expand Sec. 2.1–2.3 with an implementation-ready description: (i) device/hardware details and accelerometer range; (ii) filtering and downsampling procedure (explicitly state any anti-alias filtering), and how step annotations are mapped to samples/windows after downsampling; (iii) preprocessing per axis (standardization, gravity removal, coordinate handling); (iv) exact CNN architecture (filters, kernel sizes, strides/padding, pooling, dense units, dropout/BN); (v) training details (optimizer, LR, batch size, epochs, early stopping metric/patience); (vi) within-fold validation scheme (window-level split across training subjects vs subject-level split), guaranteeing no leakage from the test subject; and (vii) code/data availability statement (Sec. 2.5 or Sec. 4) plus seed/control for reproducibility.
-
Statistical testing/reporting is incomplete given the number of comparisons and the manuscript’s reliance on p-values for conclusions (Sec. 2.5; Sec. 3.3; Abstract/Conclusion). Tables 2–3 reportedly include test statistics but omit p-values/directions; effect sizes and confidence intervals are not provided; and no multiple-comparison correction is described despite many hypothesis tests across multiple metrics/configurations/subgroups. Additionally, the Abstract/Introduction/Conclusion cite specific p-values and mention age-related Kruskal–Wallis tests even though the age analysis is not completed (Sec. 3.5.2).
Recommendation: Revise Sec. 2.5 and Sec. 3.3–3.5 to (i) report effect sizes (e.g., rank-biserial correlation or Cliff’s delta) and median paired differences with confidence intervals for key comparisons; (ii) add p-values and the direction of effects to Tables 2–3; and (iii) apply and disclose a multiple-testing correction (e.g., Holm) within coherent families (by metric or by research question), or clearly state that results are exploratory with unadjusted p-values and temper claims accordingly. Ensure every p-value mentioned in the Abstract/Introduction/Conclusion is traceable to a specific test in Sec. 3, and remove/relocate any age-related inferential claims unless the analysis is actually performed.
-
Age-related subgroup analysis is inconsistent/unfinished but still appears in the methods narrative and figures, risking reader confusion and over-claiming (Sec. 2.1.3; 2.5.2; 3.5.2; Fig. 6). The manuscript describes age stratification and cites a figure outlining age analyses, yet later states age metadata were incorrect and analysis could not be completed.
Recommendation: Choose one of two clean resolutions: (a) repair the age metadata pipeline and rerun the planned age-stratified analyses (including explicit group sizes and tests), updating Fig. 6 with real results; or (b) remove age-analysis claims from the Abstract/Introduction/Conclusion and reposition Fig. 6 as a schematic in Methods/Supplement clearly labeled as “planned future analysis.” In Sec. 3.5.2, describe the metadata issue precisely and confirm which covariates (e.g., sex) are unaffected.
-
Scope/generalizability claims are stronger than the data support (Introduction; Sec. 4). The dataset is moderate (39 participants) with recordings on the order of $\sim$58 minutes and may not reflect true multi-day free-living variability, device heterogeneity, or broader population diversity. Given the approximations in overlap $+$ Poisson uncertainty, claims that the framework provides robust free-living guidance should be tempered.
Recommendation: Add a dedicated Limitations paragraph (Sec. 4) that explicitly covers dataset size, recording duration, controlled vs truly free-living conditions, single-device/single-cohort constraints, and the modeling approximations (Poisson, overlap dependence). Rephrase any strong prescriptive statements (“critical insights,” “25 Hz largely sufficient”) to be conditional on this dataset and evaluation setting, and highlight needed external validation.