Gate mpi4py import on MPI launcher environment (fixes #744)#747
Gate mpi4py import on MPI launcher environment (fixes #744)#747cailmdaley wants to merge 2 commits into
Conversation
Importing mpi4py initializes MPI at import time, which aborts the whole process when Open MPI detects a SLURM step environment but no PMI server — i.e. inside any srun-launched shell on a cluster whose container OMPI lacks SLURM PMI support. Even shapepipe_run -h dies before printing (#744; empirically bisected to SLURM_STEP_ID being set). Gate the import on launcher-set env vars (OMPI_COMM_WORLD_SIZE for mpirun, PMI_RANK for srun --mpi=pmi2, PMIX_RANK for srun --mpi=pmix). A bare shapepipe_run never touches MPI and runs SMP as before; mpirun launches are unchanged. Verified on candide: bare run under srun now exits 0 (was OPAL abort), mpirun -n 1 and -n 2 paths intact. Behavior note: a config with MODE = MPI launched without any MPI launcher now falls back to SMP instead of running single-rank MPI. Closes #744 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The else-branch of the mpi4py dependency line appended the whole list to itself (harmless only because DependencyHandler dedups); with the launcher gate this became the common path. And the regression test now surfaces the subprocess stderr on failure instead of swallowing it in a bare CalledProcessError. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
An independent fresh-context review of the diff came back clean (no NameError path when One edge worth knowing, not a regression: — Claude on behalf of Cail. |
Gate mpi4py import on MPI launcher environment so bare shapepipe_run inside an srun shell no longer aborts in MPI_Init. Same commits as PR #747; folded here so the fix ships with the ngmix v2.0 line.
Fixes #744.
The mechanism
src/shapepipe/run.pydoesfrom mpi4py import MPIat module import, which initializes MPI immediately — even forshapepipe_run -h. Inside ansrun-launched shell, Open MPI sees the SLURM step environment, decides it was direct-launched by srun, and looks for a PMI server that srun never started (the container's OMPI is not built with SLURM PMI support). MPI_Init aborts the whole process before the pipeline ever runs.Empirically bisected on candide: unsetting
SLURM_STEP_IDalone is enough to make the unpatched code work — that's the variable OMPI keys its srun-detection on.mpirunworks because it brings its own PMIx server.The fix
Only import (hence initialize) mpi4py when a launcher environment is actually present:
OMPI_COMM_WORLD_SIZE— set bympirun/orterunPMI_RANK— set bysrun --mpi=pmi2PMIX_RANK— set bysrun --mpi=pmixA bare
shapepipe_run(login node, compute-node shell, container) never touches MPI and runs SMP exactly as before; MPI launches are unchanged.Verification (on candide, inside the
ngmix_v2.0container under a realsrunallocation)shapepipe_run -hunder srun shellmpirun -n 1 shapepipe_run -h(the workaround from the issue)mpirun -n 2→import_mpi, ranksTrue, ranks 0/1Plus a subprocess-based regression test (
test_run.py) that scrubs/sets the launcher env and asserts the gate in both directions.Behavior note: a config with
MODE = MPIlaunched without any MPI launcher now falls back to SMP instead of running single-rank MPI. Arguably the saner behavior, but flagging it.— Claude on behalf of Cail.
🤖 Generated with Claude Code