-
Temporal availability / leakage risk: the manuscript repeatedly frames features as “known at admission,” but the diagnoses used come from discharge abstracts and are typically coded over the entire stay. Even with POA indicators, the *presence and count* of recorded diagnoses (including secondary diagnoses up to 24) can reflect post-admission events, complications, documentation intensity, and length-of-stay itself. In particular, NUM_POA_Y_DIAGNOSES and inclusion of all secondary diagnoses may encode severity/coding intensity that is not truly available at admission, and may inflate performance or change the interpretation from early prediction to retrospective risk adjustment (Sec. 1, Sec. 2.3.2, Sec. 2.6, Sec. 4).
Recommendation: Clarify in Sec. 1 and Sec. 2.2 what “available at admission” means in the context of PUDF discharge abstracts, and explicitly position the use case as (a) early risk prediction vs (b) retrospective risk adjustment. Add sensitivity analyses in Sec. 3.4 that restrict diagnosis inputs to plausibly admission-known subsets (e.g., principal diagnosis only; principal + POA=$'Y'$ only; or excluding diagnoses with POA$\neq$'$'Y'$/missing) and report how mortality/PLOS performance changes. Discuss coding/documentation processes and how they could bias NUM_POA_Y_DIAGNOSES (Sec. 4).
-
Mortality downstream performance discrepancy strongly suggests a pipeline/implementation problem: the Transformer proxy model reports moderate PR-AUC on validation ($\sim 0.37$ in Sec. 3.3), yet downstream “Attention” models show extremely low test PR-AUC ($\approx 0.036$–$0.056$ in Sec. 3.4), worse than the non-diagnosis baseline. Such a large drop is atypical if embeddings preserve useful signal and raises concerns about misaligned embeddings/labels, split contamination, masking/padding effects in pooling, feature scaling/concatenation issues, or accidental train/test mix-ups (Sec. 2.4–2.7, Sec. 3.3–3.4).
Recommendation: Add explicit sanity checks and diagnostics: (i) train a simple classifier using *only* the frozen patient embedding to predict mortality on the same split and compare its AUC-ROC/PR-AUC to the proxy head (Sec. 3.3–3.4); (ii) verify embedding–label alignment after joins/merges (hash/ID checks) and confirm embeddings are generated separately for train/val/test using the Transformer fit only on the training split (Sec. 2.7); (iii) report embedding norms, fraction of near-constant dimensions, and whether PAD tokens are masked in attention and pooling (Sec. 2.4.3); (iv) ensure downstream preprocessing (scaling/encoding) is fit on training only and applied to val/test. If a bug is found, re-run Sec. 3.4 and update conclusions accordingly.
-
Data flow and splitting strategy are ambiguous across proxy training and downstream tasks, making it difficult to assess leakage and fairness. Sec. 3.1 mentions a 70/15/15 split on the 1% subsample, but it is unclear whether this split is used for (a) Transformer proxy training, (b) downstream mortality/PLOS training, and (c) baseline feature selection (top-$N$ diagnoses, target encoding). Without a single consolidated description, readers cannot verify that (i) the Transformer never “sees” validation/test labels or examples during representation learning, and (ii) feature engineering (top-$N$ selection, target encoding) is fit only on the training portion (Sec. 2.4–2.7, Sec. 3.1–3.4).
Recommendation: Add a single schematic/table (Sec. 2.7) specifying for each stage (Transformer proxy, embedding generation, downstream LR/XGBoost, Baseline 2 code selection, target encoding): exact split sizes, stratification, and what is fit on train vs applied to val/test. Explicitly state that (1) embeddings for val/test are generated by a Transformer trained only on the training subset; (2) top-$N$ code lists are derived from training only (Sec. 2.6.2); and (3) target encoding maps (PAT_COUNTY, ZIP3) are learned on training only, with smoothing, and then applied to val/test.
-
Severe subsampling ($1\%$ / $31,\!102$ records) is a central experimental constraint, yet conclusions sometimes read as general statements about attention models vs feature engineering rather than about this resource-constrained regime. Given Transformer capacity, training on $31$k examples for 3 epochs may be underpowered; conversely, strong tabular baselines may improve substantially when trained on the full $3.1$M records. Without any scaling experiment, it is unclear whether the negative result reflects the architecture, the training regime, or sample size (Sec. 2.1, Sec. 3.3–3.4, Sec. 4).
Recommendation: Make the limited-scale nature explicit in Sec. 2.1 and Sec. 4 and temper general claims accordingly. If feasible, train Baseline 1 and Baseline 2 on the full dataset to provide a realistic reference point. Add at least a small scaling study for the Transformer (e.g., $1\%$ vs $2\%$ vs $5\%$ subsamples, or a smaller $d_\text{model}$ / fewer layers enabling more data) with downstream results in Sec. 3.4, or include a quantitative discussion (parameter counts, training steps, expected sample complexity) if additional experiments are infeasible.
-
Transformer modeling choices are under-specified and, in places, internally ambiguous: ordering of diagnoses (set vs sequence), positional encodings, padding masks in attention and mean pooling, and the dimensionality path from concatenated diagnosis+POA embeddings to the Transformer $d_\text{model}$ and final $160$-D patient embedding are not fully defined. These details are crucial both for reproducibility and for interpreting why a Transformer would help on essentially set-like inputs (Sec. 2.3.2, Sec. 2.4.1–2.4.3).
Recommendation: In Sec. 2.4.1–2.4.3, fully specify: (i) the token ordering rule (e.g., principal first, then other diagnoses in file order OTH1..OTH24) and whether this order has semantics; (ii) whether positional embeddings are used and of what type; (iii) attention masking and masked-mean pooling over non-PAD tokens; and (iv) the exact dimensions ($D_\text{dx}$, $D_\text{poa}$, concatenation dimension, any projection layer, $d_\text{model}$, and the mapping to the $160$-D pooled embedding). Consider adding a permutation-invariant baseline (e.g., DeepSets / Set Transformer / attention pooling) or justify why sequence modeling is appropriate here.
-
Proxy-task design is not well justified and may be mismatched to downstream evaluation: the encoder is trained only on mortality (proxy) and then used for both mortality and PLOS, but the paper does not test PLOS-trained or multi-task encoders, nor does it compare frozen embeddings vs end-to-end fine-tuning. For mortality, the two-stage “pretrain on mortality then train another model for mortality” pipeline is not clearly motivated relative to a single end-to-end model (Sec. 2.4, Sec. 3.3–3.4).
Recommendation: Add experiments (as feasible) training (i) a PLOS-proxy encoder, (ii) a multi-task encoder (mortality+PLOS heads), and (iii) at least one end-to-end fine-tuning setup where the encoder is updated for the downstream task (Sec. 3.4). If compute prohibits this, expand Sec. 4 to explicitly acknowledge that the negative result may be specific to mortality-proxy + frozen-embedding transfer and may not generalize to end-to-end or task-aligned training.
-
Baseline 2 is central to the conclusions but is under-specified, and fairness of the comparison is unclear: the value of $N$ is inconsistent/vague (“$200$–$300$” vs fixed $N=200$), selection criteria may leak information if computed on all data, and it is unclear whether principal and POA=$'Y'$ top-$N$ lists are separate, how overlaps are handled, and whether principal diagnoses are included in the POA=$'Y'$ multi-hot (Sec. 2.6.2, Sec. 3.4).
Recommendation: Rewrite Sec. 2.6.2 to precisely define Baseline 2: fixed $N$ (or a small tuned set), separate vs joint vocabularies for principal and POA=$'Y'$, whether principal is included in POA=$'Y'$, handling of overlaps/ties, and explicit statement that ranking/frequency counts use training data only. Consider adding a slightly stronger but still “simple” baseline that uses dx3$\times$POA cross-features (e.g., hashing) to more directly mirror what the Transformer could represent, and report its performance in Sec. 3.4.
-
Evaluation lacks uncertainty, calibration, and operating-point reporting, limiting the strength and practical meaning of model comparisons—especially when differences are modest for PLOS. The manuscript also lists many metrics (F1, specificity, Brier score) but does not systematically report them (Sec. 2.7, Sec. 3.4).
Recommendation: In Sec. 3.4, add bootstrap $95\%$ CIs for AUC-ROC and PR-AUC (and optionally for PR-AUC differences) and indicate which pairwise differences are statistically meaningful. Add calibration evaluation (e.g., reliability plot + Brier score or ECE) at least for the best baseline vs attention model. If claiming admission-time triage utility, report clinically interpretable operating points (e.g., precision at fixed recall) alongside PR-AUC. Align Sec. 2.7 with what is actually reported, or move the full metric table to an appendix.