Skip to content

Gate mpi4py import on MPI launcher environment (fixes #744)#747

Open
cailmdaley wants to merge 2 commits into
developfrom
fix/lazy-mpi-init
Open

Gate mpi4py import on MPI launcher environment (fixes #744)#747
cailmdaley wants to merge 2 commits into
developfrom
fix/lazy-mpi-init

Conversation

@cailmdaley

Copy link
Copy Markdown
Contributor

Fixes #744.

The mechanism

src/shapepipe/run.py does from mpi4py import MPI at module import, which initializes MPI immediately — even for shapepipe_run -h. Inside an srun-launched shell, Open MPI sees the SLURM step environment, decides it was direct-launched by srun, and looks for a PMI server that srun never started (the container's OMPI is not built with SLURM PMI support). MPI_Init aborts the whole process before the pipeline ever runs.

Empirically bisected on candide: unsetting SLURM_STEP_ID alone is enough to make the unpatched code work — that's the variable OMPI keys its srun-detection on. mpirun works because it brings its own PMIx server.

The fix

Only import (hence initialize) mpi4py when a launcher environment is actually present:

  • OMPI_COMM_WORLD_SIZE — set by mpirun/orterun
  • PMI_RANK — set by srun --mpi=pmi2
  • PMIX_RANK — set by srun --mpi=pmix

A bare shapepipe_run (login node, compute-node shell, container) never touches MPI and runs SMP exactly as before; MPI launches are unchanged.

Verification (on candide, inside the ngmix_v2.0 container under a real srun allocation)

scenario unpatched patched
bare shapepipe_run -h under srun shell OPAL abort, exit 1 exit 0
mpirun -n 1 shapepipe_run -h (the workaround from the issue) exit 0 exit 0
mpirun -n 2import_mpi, ranks True, ranks 0/1
bare on login node exit 0 (singleton init) exit 0 (no MPI touched)

Plus a subprocess-based regression test (test_run.py) that scrubs/sets the launcher env and asserts the gate in both directions.

Behavior note: a config with MODE = MPI launched without any MPI launcher now falls back to SMP instead of running single-rank MPI. Arguably the saner behavior, but flagging it.

— Claude on behalf of Cail.

🤖 Generated with Claude Code

Importing mpi4py initializes MPI at import time, which aborts the whole
process when Open MPI detects a SLURM step environment but no PMI server
— i.e. inside any srun-launched shell on a cluster whose container OMPI
lacks SLURM PMI support. Even shapepipe_run -h dies before printing
(#744; empirically bisected to SLURM_STEP_ID being set).

Gate the import on launcher-set env vars (OMPI_COMM_WORLD_SIZE for
mpirun, PMI_RANK for srun --mpi=pmi2, PMIX_RANK for srun --mpi=pmix).
A bare shapepipe_run never touches MPI and runs SMP as before; mpirun
launches are unchanged. Verified on candide: bare run under srun now
exits 0 (was OPAL abort), mpirun -n 1 and -n 2 paths intact.

Behavior note: a config with MODE = MPI launched without any MPI
launcher now falls back to SMP instead of running single-rank MPI.

Closes #744

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The else-branch of the mpi4py dependency line appended the whole list
to itself (harmless only because DependencyHandler dedups); with the
launcher gate this became the common path. And the regression test now
surfaces the subprocess stderr on failure instead of swallowing it in
a bare CalledProcessError.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@cailmdaley

Copy link
Copy Markdown
Contributor Author

An independent fresh-context review of the diff came back clean (no NameError path when import_mpi is False; no other unconditional mpi4py imports in the tree; all of this repo's launch paths — candide_mpi.sh, cc_mpi.sh, the documented mpiexec invocations — open the gate). Two of its nits are folded into dab2d5c1: the module_dep else-branch self-appended the whole list (pre-existing, but the gate made it the common path), and the regression test now surfaces subprocess stderr on failure.

One edge worth knowing, not a regression: srun -n4 shapepipe_run without a PMI-enabled MPI used to abort loudly; with the gate each rank now runs an independent SMP pipeline (N duplicate runs). A SLURM_NTASKS > 1-but-no-launcher warning would be a cheap follow-up if that ever bites.

— Claude on behalf of Cail.

cailmdaley added a commit that referenced this pull request Jun 11, 2026
Gate mpi4py import on MPI launcher environment so bare shapepipe_run
inside an srun shell no longer aborts in MPI_Init. Same commits as
PR #747; folded here so the fix ships with the ngmix v2.0 line.
@cailmdaley

Copy link
Copy Markdown
Contributor Author

Folded into ngmix_v2.0 (6b30fd4, same commits — true merge), so the #744 fix ships with the ngmix v2.0 line regardless of review order here. If #741 lands first, GitHub will mark this PR merged automatically; reviewing either place covers both.

— Claude on behalf of Cail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] slurm on candide

1 participant