-
The scope and limitations of the work are not always clearly stated: the Introduction and some later sections could be read as implying full equation discovery, and the strong phrasing around pressure‑gradient dominance and spectral‑method superiority may overstate what is supported by the presented diagnostics (Introduction, Sec. 1; Sec. 2.6; Sec. 3.3–3.4; Sec. 4).
Recommendation: In the final paragraph of the Introduction (Sec. 1) and in the Conclusions (Sec. 4), explicitly state that the paper focuses on generating and conditioning features and targets for subsequent sparse‑regression‑based equation discovery, which is deferred to future work. At the same time, moderate language around pressure dominance and derivative‑scheme choice to indicate that findings "suggest" or are "consistent with" certain interpretations, rather than being "unequivocal" or definitive.
-
The description of data normalization and standardization could be misinterpreted, particularly regarding the order of operations (e.g., conversion from $\rho$ to $\rho'$, computation of global statistics, and standardization of primary fields vs. derived features) and the mixed use of the terms "normalization" and "standardization" (Sec. 2.2; Sec. 2.5; Sec. 3.1; Sec. 3.5).
Recommendation: In Sec. 2.2 and Sec. 2.5, explicitly list the preprocessing steps (e.g., 1) compute $\rho' = \rho - 1$; 2) compute global means and standard deviations of $u, v, w, \rho'$ over all $x, y, z, t$; 3) standardize these primary fields; 4) construct features from standardized fields; 5) standardize each feature column independently). Use "normalization" only for operations like defining $\rho'$, and reserve "standardization" for mean‑zero, unit‑variance scaling. In Sec. 3.1 and Sec. 3.5, refer back to this procedure to reinforce consistency.
-
The evaluation of temporal polynomial fits in Sec. 3.2 focuses on RMSE at two selected points (one "smooth" and one "sharp" identified via $G^2$), with no global statistics to demonstrate performance across the entire domain and all variables.
Recommendation: Extend Sec. 3.2 to report summary statistics of fit quality or derivative accuracy over a larger random sample or the full domain (e.g., distributions of RMSE per component, median/percentile values, or norms of residuals). This will support the claim that temporal derivative estimation is robust globally rather than only at two illustrative points.
-
The definition and use of the gradient‑energy metric $G^{2} = \sum_{i,j} (\partial u_i/\partial x_j)^2$ to distinguish smooth and sharp regions are introduced without specifying how the derivatives are computed (spectral or WENO5) and whether standardized or raw fields are used (Sec. 3.2).
Recommendation: Clarify in Sec. 3.2 the exact procedure for computing $G^2$: state which derivative scheme is used, which components are included, and whether derivatives are based on standardized or physical‑unit fields. Briefly justify this choice so readers can interpret the reported $G^2$ values in relation to flow structures and derivative quality.
-
The construction of the feature library mentions "additional density‑weighted velocity gradient terms derived from $\rho'\nabla \mathbf{u}$" but does not explicitly enumerate these components or relate them clearly to the stated total of 26 features, leaving ambiguity about the exact feature set (Sec. 2.5; Sec. 3.5).
Recommendation: Add a table or explicit bullet list in Sec. 2.5 (or an appendix) that enumerates all 26 feature terms with their mathematical expressions, including the precise set of components from $\rho'\nabla \mathbf{u}$ (e.g., all 9 tensor components or a subset). In Sec. 3.5, reference this list and confirm that the same feature set is used in all analyses.
-
The dataset description and physical assumptions (nearly incompressible, low‑Mach/Boussinesq regime) are only briefly sketched and are not clearly tied to specific non‑dimensional parameters or to the governing equations of the data source (Sec. 2.1; Sec. 3.1).
Recommendation: In Sec. 2.1 and Sec. 3.1, provide a concise description of the data provenance (simulation code or experiment, governing equations, forcing, Reynolds number and any relevant non‑dimensional numbers, and whether a Boussinesq approximation is used). Connect observed density fluctuations (e.g., $\sim 0.2\%$ of the mean) to standard low‑Mach/Boussinesq criteria, citing appropriate references, and explicitly state which physical assumptions are adopted in constructing the feature library (e.g., constant viscosity, dominant buoyancy term).
-
The choice and interpretation of a constant term in the feature library are only briefly justified as accounting for mean offsets or uniform forcing, which may be unclear in a decaying incompressible flow on a periodic domain (Sec. 2.5; Sec. 3.5).
Recommendation: In Sec. 2.5, more clearly explain under what circumstances a constant term might arise effectively (e.g., mean pressure gradient or uniform body force) and emphasize that including a constant feature allows the sparse regression to detect and discard it if unnecessary. Optionally, note any empirical evidence from residuals that motivates its inclusion.
-
The momentum‑residual analysis in Sec. 3.4 is mainly qualitative; quantitative comparisons between residual magnitudes and modeled terms (e.g., advection, temporal derivative) are limited, which weakens quantitative support for the interpretation of dominant unmodeled forces.
Recommendation: Enhance Sec. 3.4 with quantitative diagnostics, such as distributions or norms of $|R_u|$ compared to $|\partial u/\partial t|$ and $|(\mathbf{u}\cdot\nabla)u|$, spatial correlation plots with density or kinetic energy, or ratios of residual to modeled term magnitudes. Use these metrics to support more nuanced statements about the relative importance of unmodeled contributions.
-
Some implementation and documentation details that affect interpretability and reproducibility are only implicit, such as: (i) how temporal‑regression windows are defined at the first and last time slices; (ii) how aliasing or spectral ringing is handled in FFT‑based derivatives; and (iii) how the total number of spatio‑temporal points relates to the grid and time dimensions (Sec. 2.3–2.4; Sec. 3.3; Sec. 3.5).
Recommendation: In Sec. 2.3, explicitly describe the temporal windowing strategy for polynomial fits, including boundary handling. In Sec. 2.4 and Sec. 3.3, state whether any filtering or dealiasing is applied to FFT‑based derivatives and briefly discuss potential Gibbs phenomena near sharp gradients. In Sec. 3.5, explicitly connect the reported $20,971,520$ spatio‑temporal points to the $128^3$ grid and $10$ time slices for clarity.
-
Ethical and broader‑impact considerations of data‑driven PDE discovery are not discussed (Sec. 1–4), even though some venues expect at least a brief statement, albeit the present application is to fluid dynamics data.
Recommendation: Add a short paragraph near the end of Sec. 4 noting that the work uses simulation/physical data only, carries no direct human or animal subjects implications, and briefly commenting on potential downstream impacts (e.g., improved modeling in engineering or climate applications) and any associated responsibilities.
-
Across Figures 1 and 2, panel titles and legends include implementation-centric or code-like details, legends are duplicated, y-axis ranges differ across panels, and color encoding alone may challenge some viewers. Figure 2 also uses ambiguous abbreviations and may obscure markers with fit lines.
Recommendation: Replace technical titles with concise scientific descriptions; use shared or compact legends; adopt consistent y-limits or annotate scales; differentiate data and fits with both color and shape/line style; clarify abbreviations and improve marker visibility.
-
Figures 3, 4, and 5 lack explicit details on grid resolution, domain size, boundary conditions, and normalization. Panel labels, axis units, and scale information are often missing or inconsistent. Figure 4 mixes spatial and distributional scopes, and Figure 5's right-column panels misrepresent distributions and lack zero reference lines.
Recommendation: Annotate grid/domain details and normalization in captions or panels; add panel labels and consistent axis units; clarify scope of distributions; replace line traces with true distributions and overlay zero reference lines.
-
Notation overload: the symbol $u$ is used both for the velocity vector $\mathbf{u}=(u,v,w)$ and for the $x$-component $u$, which makes expressions like $\nabla^2 u$ and $(u\cdot\nabla)u$ potentially ambiguous without context.
Recommendation: Use boldface (e.g., $\mathbf{u}$) for vectors and plain symbols for components ($u$,$v$,$w$), or adopt index notation ($u_i$) consistently when defining operators and residuals.
-
The viscosity-regression description ‘regressing $\partial u/\partial t$ against $\nabla^2 u$’ omits necessary assumptions for the implied relation (e.g., neglecting advection and pressure) and does not show the algebraic model being fit; as written, the mapping from Navier–Stokes form to that regression is not derivable from the paper alone.
Recommendation: State the regression model explicitly (e.g., $\partial u/\partial t \approx \nu\nabla^2 u + \epsilon$) and list which terms are omitted and why; clarify whether the regression is componentwise ($u$,$v$,$w$) and how the slopes are combined.
-
Several key quantitative claims (e.g., RMSEs, divergence RMS/max, viscosity regression, and exact standardization) are not verifiable from text alone and require underlying data/fields.
Recommendation: Provide the underlying arrays/feature matrix (or sufficient intermediate summaries) to allow recomputation of RMSEs, divergence statistics, regression outputs, and standardization properties.