-
External validity is limited by a single very small empirical design ($N=10$, $1{,}000$ days, $60$-day rolling window) and one bespoke two-factor specification (PC1 market $+$ a hand-built tech long–short factor) (Sec. 2.1, Sec. 2.2.1, Sec. 3.1–3.2). With $N=10$ the setting is not “high-dimensional,” and factor models typically show their main benefits in larger universes and/or richer factor sets; conversely, the extreme condition numbers reported for the factor model may be idiosyncratic to this universe, window length, and factor definition. As written, conclusions in Sec. 4 can read as broadly ruling out structural factor covariance models under heteroskedasticity, which is stronger than the current evidence supports.
Recommendation: Add robustness checks in Sec. 3 that vary at least: (i) window length (e.g., $40/60/120$ days), (ii) asset universe size/composition (e.g., expand to $30$–$50$ equities; try a different sector mix or market), and (iii) factor specification complexity (market-only; alternative sector split; optionally a standard style factor if available). Report how realized risk, condition numbers, and turnover move across these variants. If expansion is infeasible, explicitly narrow the claim in Sec. 4 to the studied small-universe/short-window setting and discuss why results might differ for larger universes where factor structure is typically stabilizing.
-
Key methodological details are missing, limiting reproducibility and making it difficult to assess whether the factor model’s instability is intrinsic or implementation-induced (Sec. 2.1–2.2.2). Missing/unclear items include: the exact GARCH(1,1) mean specification and innovation distribution (Gaussian vs $t$), parameter constraints, estimation method, and whether GARCH parameters are re-estimated each window (Sec. 2.1); the exact construction of the technology subset and long–short factor (constituents, weights, normalization/standardization, time-invariance) and any subsequent scaling after Gram–Schmidt (Sec. 2.2.1); PCA preprocessing (demeaning, correlation vs covariance matrix); and which Ledoit–Wolf constant-correlation variant/target/intensity formula and software implementation are used (Sec. 2.2.2).
Recommendation: Expand Sec. 2.1–2.2 (or add an implementation appendix) specifying: (a) the full GARCH model (mean, distribution, estimation routine, re-estimation frequency, convergence handling); (b) factor definitions with an explicit ticker list for the tech leg, long/short weighting scheme, normalization (e.g., dollar-neutral and unit-variance), and whether factors/loadings are re-scaled after orthogonalization; (c) PCA computation details (demeaning, matrix choice, sign convention handling); and (d) the exact Ledoit–Wolf reference/variant and how $\delta_t$ and the constant-correlation target $F_t$ are computed, including library/code used. This will make the pipeline auditable and help interpret the source of numerical problems.
-
There is a material inconsistency/ambiguity in the factor-model fit ($R^2$): the text reports $R^2$ values “mostly between $0.4$ and $0.6$” (Sec. 3.2), but Fig. 2 (as described in the unstructured report and noted in the structured report) appears to show values around $\sim0.83$–$1.00$, with spikes to $1.0$. This is central because Sec. 3.2–4 uses $R^2$ stability/level to argue factor span is adequate and that instability instead comes from $\Psi_t$ and rescaling. If $R^2$ is miscomputed, aggregated differently than stated, or affected by leakage/look-ahead, the causal narrative becomes unreliable.
Recommendation: Audit and reconcile the $R^2$ definition and plotting in Sec. 3.2 / Fig. 2: state precisely whether Fig. 2 shows (i) cross-sectional average of per-asset OLS $R^2$, (ii) a variance-explained ratio from PCA, or (iii) something else; clarify whether $R^2$ is in-sample within the $60$-day window or evaluated out-of-sample; and report summary stats (mean/median/IQR/min/max) across time and assets. Plot $R^2$ on the full $[0,1]$ y-axis (optionally with an inset) and investigate spikes to exactly $1.0$ (potential degenerate windows, near-collinearity, or implementation errors). If any look-ahead is present, correct it and update the conclusions in Sec. 4 accordingly.
-
Time-indexing around GARCH standardization and rescaling is ambiguous/inconsistent (Sec. 2.1–2.3; Eq. (1) vs narrative; Eqs. (3) and (5)). The manuscript describes one-step-ahead forecasts (for day $t+1$) but standardizes as $z_{i,t}=r_{i,t}/\hat\sigma_{i,t}$ (Eq. (1)) and rescales using $\mathrm{diag}(\hat\sigma_t)$ (Eqs. (3)/(5)). Without explicit timing, it is unclear whether $\Sigma_t$ used for weights targets $\mathrm{Cov}(r_{t+1}|F_t)$ or $\mathrm{Cov}(r_t|F_{t-1})$, and mismatched indices could also contribute to apparent instability.
Recommendation: Make timing explicit and consistent throughout Sec. 2: define whether $\hat\sigma_{i,t}$ denotes the conditional s.d. for $r_{i,t}$ given information at $t-1$, or the forecast for $r_{i,t+1}$ given information at $t$. Then update Eq. (1) and the rescaling in Eqs. (3)/(5) to use matching indices (e.g., use $\mathrm{diag}(\hat\sigma_{t+1|t})$ if $\Sigma_t$ is meant to forecast next-day return covariance). Add one sentence in Sec. 2.3 clarifying which $\Sigma$ is optimized to generate $w_t$ and which realized return $r_{t+1}$ evaluates it.
-
The diagnosis of the factor model’s extreme ill-conditioning is plausible but remains largely qualitative and under-identified: it is unclear whether instability originates in (i) innovation-space factor estimation, (ii) near-zero/noisy idiosyncratic variances $\Psi_t$, (iii) the GARCH rescaling step amplifying dispersion in vol forecasts, or (iv) PD/regularization/solver handling (Sec. 3.1–3.2, Sec. 4). The reported mean condition numbers ($\approx88{,}480$ for the factor model) are unusually large for a low-rank-plus-diagonal covariance unless some $\psi_{i,t}$ are extremely small or numerical handling is problematic.
Recommendation: Add targeted diagnostics in Sec. 3.2 to isolate the mechanism: (a) report condition numbers for innovation covariances $\Sigma_{z,t}$ (before rescaling) for both methods; (b) decompose factor covariance conditioning by reporting $\kappa(B_t \Omega_t B_t^\top)$, $\kappa(B_t \Omega_t B_t^\top+\Psi_t)$ in innovation space, and $\kappa$ after rescaling; (c) report the empirical distribution over time of diagonal $\psi_{i,t}$ (min/percentiles) and of $\hat\sigma_{i,t}$ (min/percentiles), and show whether spikes in $\kappa$ line up with extreme $\psi$ or $\sigma$; (d) implement minimal regularizations—e.g., floor $\psi_{i,t}\geq\epsilon$, shrink $\Psi_t$ toward a constant-diagonal target, or smooth $\Psi_t$ over time—and show the impact on $\kappa$, realized risk, and turnover. These additions would convert the narrative in Sec. 4 from conjecture to evidence.
-
Performance evaluation is not statistically characterized and the realized “variance” metric is potentially misinterpreted (Sec. 2.3.2, Sec. 3.1). The reported daily realized variance uses $w^\top r r^\top w = (w^\top r)^2$, which is a squared realized portfolio return (a second moment), not a variance estimator unless carefully aggregated and mean effects are addressed. In addition, the comparison relies mainly on time-series averages without dispersion measures, confidence intervals, or paired tests, so it is unclear whether the difference (e.g., $0.000126$ vs $0.000153$) is statistically/economically meaningful.
Recommendation: In Sec. 2.3.2, rename the metric as “squared realized return” (or explicitly justify interpreting its time-average as an out-of-sample second moment under a zero-mean approximation). Complement it with a standard out-of-sample variance estimate of portfolio returns over the backtest (or rolling realized variance of portfolio returns). In Sec. 3.1, add dispersion (SD/IQR) for realized risk, condition numbers, and turnover; compute confidence intervals for mean differences (e.g., block bootstrap over days); and run simple paired tests on daily squared returns. Optionally report basic return metrics (mean return, volatility, Sharpe) to contextualize whether lower risk coincides with comparable returns.
-
Figures and key result presentation contain omissions and potential errors that materially affect interpretability (Sec. 3.1–3.2). Figure 1 is described as multi-panel (variance/condition number/turnover) but appears incomplete; axis labels/units/time scale are unclear; and condition numbers likely require log scaling to be readable. Figure 2 has the $R^2$ discrepancy noted above and the y-axis treatment may visually overstate changes. These presentation issues impede verification of the main claims.
Recommendation: Rebuild Figure 1 as a true 3-panel figure (or separate clearly labeled subfigures) with explicit units (daily vs annualized), date axis, and a legend placed outside the plotting area; plot condition numbers on a $\log_{10}$ scale. For Figure 2, after reconciling $R^2$, use the full $[0,1]$ scale (optionally add an inset), label the x-axis with dates, and include summary statistics in the caption. Ensure captions state clearly: rolling window length, whether GARCH filtering is applied, and whether quantities are in innovation space or rescaled return space.
-
Portfolio optimization/PD handling is under-specified despite being central given the paper’s emphasis on ill-conditioning (Sec. 2.3.1, Sec. 3.1). With extreme condition numbers, results can depend heavily on whether $\Sigma_t$ is enforced to be PSD/PD (eigenvalue clipping, $\epsilon I$ jitter), how the QP is solved, and solver tolerances. Without these details, it is hard to attribute differences to covariance estimators rather than numerical optimization choices.
Recommendation: In Sec. 2.3.1, specify the solver/library used for the long-only QP, tolerances, and how non-PD or nearly singular $\Sigma_t$ is treated (symmetrization, eigenvalue clipping, ridge adjustment $\epsilon I$, using singular values for $\kappa$). Report how often PD fixes were needed under each estimator and whether any days were dropped. Consider adding weight-stability diagnostics (max weight, effective number of holdings $1/\sum w_i^2$) to connect ill-conditioning to economically meaningful portfolio concentration beyond turnover.