Summary
Harden MFC's OpenMP GPU macro layer against a silent-wrong-answer footgun with declare target static variables. We are currently safe by convention, not by construction — a single misuse of GPU_ENTER_DATA(copyin=...) or GPU_DATA(copy=/copyin=...) on a declare target variable would silently drop the transfer with no diagnostic, on the AMD path only.
This already bit us once (the viscosity term silently disabled on the AMD-flang GPU build). Root cause was confirmed to be conformant OpenMP behavior, not a compiler bug — see ROCm/llvm-project#2890 and llvm/llvm-project#203711 (both closed as working-as-intended).
Background: the OpenMP semantics
A variable in a declare target directive (especially with SAVE / static storage) has a corresponding variable in the device data environment that is persistently present for the whole program (effectively infinite reference count). Host and device hold two separate copies and are not kept coherent automatically.
Consequences:
map(to:) / target enter data map(to:) on such a variable is a no-op — the presence check sees it already resident and copies nothing. The device copy keeps its program-start value.
- The only ways to push a host-side value to the device are
target update to(...) or map(always, to: ...).
CCE (ftn) and nvfortran copy on map(to:) anyway (lenient, non-conformant convenience), so code that relies on that produces correct results on those compilers and a silent wrong answer on the conformant AMD path, with no diagnostic.
Verification
Reproduced on a Frontier gfx90a (MI210) login node with both amdflang builds from the original report — ROCm 7.2.0 (LLVM 22) and AFAR 23.2.0 (LLVM 23). Three identical declare target static arrays, host-set to 2, pushed three different ways, read back from a kernel in a different translation unit:
enter data map(to:) : 0 <- stale (no-op)
target update to : 2 <- correct
map(always,to:) : 2 <- correct
Identical on both compilers. The variable is unified across translation units; the divergence is purely the documented difference between map(to:) (no-op on an already-present var) and update/always.
Why MFC is currently safe (by convention)
The original bug variable, Re_size (integer, dimension(2), viscous index-count), is handled correctly today:
- Declared declare-target:
src/simulation/m_global_parameters.fpp:284 → $:GPU_DECLARE(create='[Re_size, Re_size_max, Re_idx]')
- Pushed via
$:GPU_UPDATE(device='[Re_size, ...]'), which expands (omp_macros.fpp OMP_UPDATE) to !$omp target update to(...) — the conformant mechanism.
- Re-issued in every consuming module right before the kernels:
m_global_parameters.fpp:1036, m_viscous.fpp:41, m_pressure_relaxation.fpp:37, m_igr.fpp:102, m_riemann_solvers.fpp:3617, m_variables_conversion.fpp:360.
That per-TU re-sync via target update is exactly what sidesteps the staleness. So master is correct.
The latent footgun
In src/common/include/omp_macros.fpp, only attach and create carry the always modifier. These do not:
GPU_ENTER_DATA(copyin=...) → target enter data map(to:...) (via OMP_COPYIN_STR → OMP_MAP_STR('to', ...))
GPU_DATA(copy=...) → target data map(tofrom:...)
GPU_DATA(copyin=...) → target data map(to:...)
If any of these is ever pointed at a declare target variable, the transfer silently no-ops — the same bug class that disabled the viscosity term. Nothing in the build, lint, or runtime flags it.
Proposed hardening (pick one or combine)
- Convention + lint (cheapest): document that declare-target statics must be synced with
GPU_UPDATE(device=...), never GPU_ENTER_DATA(copyin=)/GPU_DATA(copyin=/copy=). Add a lint_source.py check that flags any variable appearing in both a GPU_DECLARE and a copyin/copy clause.
- Make the macros robust (strongest): add the
always modifier to the copyin/copy expansions in omp_macros.fpp so these clauses can't silently drop a transfer on a present variable. Trade-off: always forces the copy unconditionally, which can cost bandwidth if such a clause is hit in a hot path. Acceptable for enter data (one-shot), riskier for GPU_DATA regions inside loops — scope carefully.
- Belt-and-suspenders: keep the convention of
GPU_UPDATE(device=...) for all declare-target syncs (the de-facto rule already) and add the lint from (1) to enforce it.
Recommendation: (1) now (cheap, catches the real failure mode), consider (2) for enter data only.
Why do it
- The failure mode is silent, compiler-specific, and produces physically wrong results (a disabled viscosity term passed unnoticed until caught by hand). AMD flang is a supported GPU backend; this will recur as more declare-target statics are added.
- The fix is mechanical and the correct pattern already exists in the codebase — this is about making it enforced rather than remembered.
References
Summary
Harden MFC's OpenMP GPU macro layer against a silent-wrong-answer footgun with
declare targetstatic variables. We are currently safe by convention, not by construction — a single misuse ofGPU_ENTER_DATA(copyin=...)orGPU_DATA(copy=/copyin=...)on adeclare targetvariable would silently drop the transfer with no diagnostic, on the AMD path only.This already bit us once (the viscosity term silently disabled on the AMD-flang GPU build). Root cause was confirmed to be conformant OpenMP behavior, not a compiler bug — see ROCm/llvm-project#2890 and llvm/llvm-project#203711 (both closed as working-as-intended).
Background: the OpenMP semantics
A variable in a
declare targetdirective (especially with SAVE / static storage) has a corresponding variable in the device data environment that is persistently present for the whole program (effectively infinite reference count). Host and device hold two separate copies and are not kept coherent automatically.Consequences:
map(to:)/target enter data map(to:)on such a variable is a no-op — the presence check sees it already resident and copies nothing. The device copy keeps its program-start value.target update to(...)ormap(always, to: ...).CCE (
ftn) andnvfortrancopy onmap(to:)anyway (lenient, non-conformant convenience), so code that relies on that produces correct results on those compilers and a silent wrong answer on the conformant AMD path, with no diagnostic.Verification
Reproduced on a Frontier gfx90a (MI210) login node with both
amdflangbuilds from the original report — ROCm 7.2.0 (LLVM 22) and AFAR 23.2.0 (LLVM 23). Three identicaldeclare targetstatic arrays, host-set to 2, pushed three different ways, read back from a kernel in a different translation unit:Identical on both compilers. The variable is unified across translation units; the divergence is purely the documented difference between
map(to:)(no-op on an already-present var) andupdate/always.Why MFC is currently safe (by convention)
The original bug variable,
Re_size(integer, dimension(2), viscous index-count), is handled correctly today:src/simulation/m_global_parameters.fpp:284→$:GPU_DECLARE(create='[Re_size, Re_size_max, Re_idx]')$:GPU_UPDATE(device='[Re_size, ...]'), which expands (omp_macros.fppOMP_UPDATE) to!$omp target update to(...)— the conformant mechanism.m_global_parameters.fpp:1036,m_viscous.fpp:41,m_pressure_relaxation.fpp:37,m_igr.fpp:102,m_riemann_solvers.fpp:3617,m_variables_conversion.fpp:360.That per-TU re-sync via
target updateis exactly what sidesteps the staleness. So master is correct.The latent footgun
In
src/common/include/omp_macros.fpp, onlyattachandcreatecarry thealwaysmodifier. These do not:GPU_ENTER_DATA(copyin=...)→target enter data map(to:...)(viaOMP_COPYIN_STR→OMP_MAP_STR('to', ...))GPU_DATA(copy=...)→target data map(tofrom:...)GPU_DATA(copyin=...)→target data map(to:...)If any of these is ever pointed at a
declare targetvariable, the transfer silently no-ops — the same bug class that disabled the viscosity term. Nothing in the build, lint, or runtime flags it.Proposed hardening (pick one or combine)
GPU_UPDATE(device=...), neverGPU_ENTER_DATA(copyin=)/GPU_DATA(copyin=/copy=). Add alint_source.pycheck that flags any variable appearing in both aGPU_DECLAREand acopyin/copyclause.alwaysmodifier to thecopyin/copyexpansions inomp_macros.fppso these clauses can't silently drop a transfer on a present variable. Trade-off:alwaysforces the copy unconditionally, which can cost bandwidth if such a clause is hit in a hot path. Acceptable forenter data(one-shot), riskier forGPU_DATAregions inside loops — scope carefully.GPU_UPDATE(device=...)for all declare-target syncs (the de-facto rule already) and add the lint from (1) to enforce it.Recommendation: (1) now (cheap, catches the real failure mode), consider (2) for
enter dataonly.Why do it
References
declare targetmodule variable not unified across translation units ROCm/llvm-project#2890 — closed, working-as-intended (maintainer @mjklemm confirmed the semantics)declare targetmodule variable not unified across translation units llvm/llvm-project#203711 — upstream twin, closed working-as-intendedsrc/common/include/omp_macros.fpp(OMP_ENTER_DATA,OMP_DATA,OMP_COPYIN_STR,OMP_UPDATE)Re_sizehandling acrossm_viscous.fpp,m_igr.fpp,m_riemann_solvers.fpp,m_pressure_relaxation.fpp,m_variables_conversion.fpp