-
Outcome (misreporting) measurement validity and reproducibility are currently the largest threat to the paper’s contribution. The dependent variable appears to rely on self-admitted concealment/exaggeration, but the exact item wording, response options, and classification rules into the three mutually exclusive categories (Transparent vs Concealment vs Performative) are under-specified (Sec. 2.2.1; Sec. 3.2–3.3). This also creates a conceptual ambiguity: “Transparent Reporting” may effectively mean “does not admit misreporting,” not necessarily accurate disclosure. Social desirability and introspective limits may bias both prevalence and estimated associations, and could affect the interpretation of coefficients as ‘drivers’ of misreporting rather than correlates of willingness to admit misreporting (Abstract; Introduction; Sec. 3.3–3.6).
Recommendation: In Sec. 2.2.1, fully document the DV: exact question(s), response scale/anchors, whether items were single- or multi-item, and the deterministic rules/thresholds used to map responses into the three categories (including how ties/ambiguous responses were handled). Add an appendix table listing DV items and coding logic. In Sec. 4 (limitations), explicitly discuss (i) social desirability/common-method concerns, (ii) the possibility that the baseline category captures ‘non-admission’ rather than truth, and (iii) the likely direction(s) of bias. If feasible with existing data, add robustness checks aligned with the conceptual structure: e.g., a two-stage/nested formulation (misreport vs not; then conceal vs perform) or alternative codings (e.g., collapsing to misreport vs transparent; or treating intensity/ordinality if the response scale supports it), and report whether key conclusions hold.
-
Key predictor measurement/coding is insufficiently detailed for replication and interpretability, especially for central theoretical constructs: Perceived AI Error Frequency, Job Security Confidence, Agentic AI Deployment, and Intrinsic/Extrinsic Rewards (Sec. 2.2–2.2.2; Sec. 3.3–3.5). Several are described only as numeric/categorical/binary without item wording, scale ranges, anchors, reference categories, or examples—yet the paper’s story depends on how respondents interpreted terms like “agentic AI,” “intrinsic rewards,” and “extrinsic rewards” (Sec. 1; Sec. 3.4–3.5).
Recommendation: Expand Sec. 2.2–2.2.2 with a measurement/codebook-style description for every predictor: item wording, response options (including anchors), coding (including reference categories for categorical variables), and any transformations (standardization/centering). For complex constructs (agentic AI; intrinsic/extrinsic rewards), include brief definitions and concrete examples exactly as presented to respondents; clarify whether items refer to team vs organization vs tool-level deployment. Report prevalence rates (percent “yes”) for binary items in Sec. 3.1 or at the start of Sec. 3.4–3.5 to contextualize interactions and null findings.
-
Causal/mechanistic language and policy claims are too strong for a cross-sectional, same-source survey with modest explained variance (Pseudo $R^2 \approx .078$–$.086$) and plausible endogeneity/common-method bias (Abstract; Introduction; Sec. 3.3–3.6; Sec. 4). For example, claims that Foundational Support ‘mitigates’ misreporting or is ‘more effective’ than culture may be confounded by unobserved organizational maturity, leadership quality, compliance posture, AI governance, role mix, and individual traits (conscientiousness/anxiety) that influence both perceived support and reporting behavior.
Recommendation: Across Abstract/Introduction/Results/Conclusion, systematically replace causal phrasing (“drives,” “mitigates,” “amplifies,” “effective tool,” “most powerful”) with associational language (“is associated with,” “predicts,” “is consistent with”). Add a clearly labeled limitations subsection in Sec. 4 that foregrounds endogeneity, common-method bias, and omitted variables. If feasible with the dataset, strengthen identification/robustness by (i) adding richer controls (industry, job family, tenure, job level/manager status, AI tool type, AI maturity/governance proxies), (ii) clustering standard errors at the organization level (and/or including organization fixed effects if multiple respondents per firm and firm identifiers exist), and (iii) including a simple sensitivity analysis (e.g., Oster-style bounds/E-values) to quantify how strong unobserved confounding would need to be to eliminate the key Foundational Support association.
-
Construction and conceptual separation of ‘Foundational Support’ (structural) vs cultural indices (Learning Safety; Candid Communication) is under-specified and may not cleanly support the “structure beats culture” conclusion (Sec. 2.2.2; Sec. 3.2; Sec. 3.6; Sec. 4). PCA is used after CFA non-convergence, but the manuscript does not provide full item lists, loadings, cross-loadings, dimensionality checks, or reliability statistics for all indices. The first component explaining $\sim 31\%$ variance suggests potential multidimensionality; overlap between structural and cultural items could also attenuate or mask cultural effects (including the null for learning safety).
Recommendation: In Sec. 2.2.2 and Sec. 3.2, report (preferably in an appendix) the full item lists, response scales, factor/PCA loadings (and cross-loadings if applicable), communalities, and Cronbach’s $\alpha/\omega$ for Foundational Support, Learning Safety, and Candid Communication. Justify PCA vs EFA vs averaged-scale construction given the theoretical intent (latent construct vs index). To substantiate the structure/culture distinction, run an EFA (or factor analysis) including all relevant items together and report whether distinct factors emerge; if they do not, qualify the structural-vs-cultural conclusion in Sec. 3.6 and Sec. 4 as contingent on these operationalizations.
-
Model specification reporting is incomplete, limiting evaluation and replication (Sec. 2.3–2.4; Sec. 3.3–3.6.1). The staged models are described narratively without a compact specification table; reference categories for categorical predictors are not always explicit; and readers do not have full coefficient/SE output for the multinomial models. Standardization is described as “fully standardized,” but the procedure is not defined for multinomial logit (Secs. 2.4 and 3.6; Figure 3).
Recommendation: Add (i) a model-specification table in Sec. 2.3–2.4 enumerating each stage’s predictors, coding, and reference categories, and (ii) full regression tables (main text or appendix) for every multinomial model: coefficients, SEs, $z$, $p$, $N$, log-likelihoods, pseudo-$R^2$, and outcome baseline category (Transparent). Define “fully standardized” precisely (which variables were $z$-scored; how binary/categorical predictors were handled; whether centering was applied; any rescaling of coefficients) and provide the formula used for Figure 3’s standardized coefficients.
-
The multinomial logit IIA assumption and the reported Hausman–McFadden result (“HM Stat = 0.0, $p = 1.0$”) are not adequately explained given the conceptual similarity of concealment and performative disclosure as related ‘misreporting’ choices (Sec. 2.4; Sec. 3.6.1). The HM statistic can be numerically unstable/uninformative; without details (which alternative omitted, any warnings), it is hard to assess robustness.
Recommendation: In Sec. 3.6.1 (or Sec. 2.4), explicitly document the IIA test procedure: which category was dropped, the exact test statistic format ($\chi^2(df)$), and whether any numerical issues occurred. Add at least one robustness check that relaxes IIA and/or matches the paper’s conceptual structure: a nested logit (stage 1: misreport vs transparent; stage 2: conceal vs perform), multinomial probit, or a two-equation approach. If infeasible, acknowledge IIA as a limitation and argue substantively why modest violations are unlikely to change the headline findings.
-
The ‘agentic AI deployment’ interaction is central to the narrative (Agentic Shift), but construct validity and interpretation need strengthening (Sec. 1; Sec. 3.4). A binary indicator with unclear scope (team vs organization) and respondent knowledge may be misclassified; agentic deployment likely correlates with AI maturity, governance, and role types—plausible confounds for concealment and job insecurity.
Recommendation: Provide the exact definition and item wording for agentic deployment (Sec. 2.2.2), including examples of “agentic” vs non-agentic tools and the organizational level referenced. In Sec. 3.4, report subgroup sizes/prevalence and present marginal effects or predicted probabilities (with confidence bands) for the interaction (error frequency $\times$ agentic) to make the substantive size interpretable. Add robustness checks with additional controls (industry/job family/job level/AI maturity proxies) and/or organization-level clustering/FE (if available) to show the interaction is not an artifact of correlated organizational characteristics.
-
The incentives result (intrinsic rewards associated with performative disclosure; extrinsic not significant) is intriguing but currently under-identified and risks over-interpretation as “innovation theater” (Sec. 3.5; Sec. 4). Intrinsic rewards may proxy AI discourse salience or innovation-oriented teams where overclaiming is more visible/admitted; extrinsic rewards nulls may reflect low prevalence or coarse (binary) measurement; reverse causality is also plausible (people who overclaim may perceive/endorse intrinsic rewards).
Recommendation: In Sec. 3.5, report prevalence for intrinsic/extrinsic reward measures and provide effect sizes as predicted probabilities (not just log-odds). Temper mechanism claims in Sec. 4 to “consistent with” rather than definitive theater/strategic gaming. If feasible, (i) test interactions (e.g., intrinsic rewards $\times$ job security; intrinsic $\times$ cultural climate) to assess boundary conditions; (ii) add controls for organizational AI maturity and role type; and (iii) discuss reverse-causality possibilities explicitly in limitations.