Skip to content

Harden GPU macros against silent map(to:) no-op on declare target statics (AMD path) #1613

@sbryngelson

Description

@sbryngelson

Summary

Harden MFC's OpenMP GPU macro layer against a silent-wrong-answer footgun with declare target static variables. We are currently safe by convention, not by construction — a single misuse of GPU_ENTER_DATA(copyin=...) or GPU_DATA(copy=/copyin=...) on a declare target variable would silently drop the transfer with no diagnostic, on the AMD path only.

This already bit us once (the viscosity term silently disabled on the AMD-flang GPU build). Root cause was confirmed to be conformant OpenMP behavior, not a compiler bug — see ROCm/llvm-project#2890 and llvm/llvm-project#203711 (both closed as working-as-intended).

Background: the OpenMP semantics

A variable in a declare target directive (especially with SAVE / static storage) has a corresponding variable in the device data environment that is persistently present for the whole program (effectively infinite reference count). Host and device hold two separate copies and are not kept coherent automatically.

Consequences:

  • map(to:) / target enter data map(to:) on such a variable is a no-op — the presence check sees it already resident and copies nothing. The device copy keeps its program-start value.
  • The only ways to push a host-side value to the device are target update to(...) or map(always, to: ...).

CCE (ftn) and nvfortran copy on map(to:) anyway (lenient, non-conformant convenience), so code that relies on that produces correct results on those compilers and a silent wrong answer on the conformant AMD path, with no diagnostic.

Verification

Reproduced on a Frontier gfx90a (MI210) login node with both amdflang builds from the original report — ROCm 7.2.0 (LLVM 22) and AFAR 23.2.0 (LLVM 23). Three identical declare target static arrays, host-set to 2, pushed three different ways, read back from a kernel in a different translation unit:

enter data map(to:) : 0     <- stale (no-op)
target update to    : 2     <- correct
map(always,to:)     : 2     <- correct

Identical on both compilers. The variable is unified across translation units; the divergence is purely the documented difference between map(to:) (no-op on an already-present var) and update/always.

Why MFC is currently safe (by convention)

The original bug variable, Re_size (integer, dimension(2), viscous index-count), is handled correctly today:

  • Declared declare-target: src/simulation/m_global_parameters.fpp:284$:GPU_DECLARE(create='[Re_size, Re_size_max, Re_idx]')
  • Pushed via $:GPU_UPDATE(device='[Re_size, ...]'), which expands (omp_macros.fpp OMP_UPDATE) to !$omp target update to(...) — the conformant mechanism.
  • Re-issued in every consuming module right before the kernels: m_global_parameters.fpp:1036, m_viscous.fpp:41, m_pressure_relaxation.fpp:37, m_igr.fpp:102, m_riemann_solvers.fpp:3617, m_variables_conversion.fpp:360.

That per-TU re-sync via target update is exactly what sidesteps the staleness. So master is correct.

The latent footgun

In src/common/include/omp_macros.fpp, only attach and create carry the always modifier. These do not:

  • GPU_ENTER_DATA(copyin=...)target enter data map(to:...) (via OMP_COPYIN_STROMP_MAP_STR('to', ...))
  • GPU_DATA(copy=...)target data map(tofrom:...)
  • GPU_DATA(copyin=...)target data map(to:...)

If any of these is ever pointed at a declare target variable, the transfer silently no-ops — the same bug class that disabled the viscosity term. Nothing in the build, lint, or runtime flags it.

Proposed hardening (pick one or combine)

  1. Convention + lint (cheapest): document that declare-target statics must be synced with GPU_UPDATE(device=...), never GPU_ENTER_DATA(copyin=)/GPU_DATA(copyin=/copy=). Add a lint_source.py check that flags any variable appearing in both a GPU_DECLARE and a copyin/copy clause.
  2. Make the macros robust (strongest): add the always modifier to the copyin/copy expansions in omp_macros.fpp so these clauses can't silently drop a transfer on a present variable. Trade-off: always forces the copy unconditionally, which can cost bandwidth if such a clause is hit in a hot path. Acceptable for enter data (one-shot), riskier for GPU_DATA regions inside loops — scope carefully.
  3. Belt-and-suspenders: keep the convention of GPU_UPDATE(device=...) for all declare-target syncs (the de-facto rule already) and add the lint from (1) to enforce it.

Recommendation: (1) now (cheap, catches the real failure mode), consider (2) for enter data only.

Why do it

  • The failure mode is silent, compiler-specific, and produces physically wrong results (a disabled viscosity term passed unnoticed until caught by hand). AMD flang is a supported GPU backend; this will recur as more declare-target statics are added.
  • The fix is mechanical and the correct pattern already exists in the codebase — this is about making it enforced rather than remembered.

References

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions