Ngmix v2.0 (CI mirror of #740)#741
Conversation
…to ngmix_update
- bin/ scripts were untracked, causing Docker build to fail - Fix license field to use SPDX string format (MIT) to resolve SetuptoolsDeprecationWarning Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Review-round summary. Every open thread above now has a reply; here's the shape of what landed on On this branch:
Companion PRs (the r50/T thread):
Awaiting your call (methodology, no code pushed): the — Claude on behalf of Cail |
merge_headers writes TILE_ID as the first key of the tile-level log_exp_headers<tile>.sqlite, and make_post_process derived n_hdu from the first key's value — len(tile_id_string) instead of the CCD count — so every epoch on the unscanned CCDs was silently dropped from N_EPOCH/EPOCH_* (and hence from ME vignets and shape measurement). Regression test builds the tile-mode sqlite via merge_headers and asserts an object on the last CCD keeps all its epochs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
ngmix 2.x run_fitter returns flags != 0 on failure instead of raising, and the failed result carries none of the measurement keys (g, g_cov, T, T_err, flux, flux_err, s2n); compile_results indexed them directly, so a single failed object crashed the whole tile with a KeyError at save time. Failed types are now recorded as NaN with their flags preserved.
With ignore_failed_psf=True a failed PSF epoch stays in obsdict carrying only flags/pars, so reading result['T'] KeyError-dropped the whole object even when the shear fit succeeded on the surviving epochs. The average now skips flags != 0 epochs (all-failed still hits the wsum == 0 guard), and n_epoch_model counts surviving epochs instead of submitted ones.
The rewrite hard-coded res['mcal_flags'] = 0, so the NGMIX_MCAL_FLAGS column written by make_cat was constant-zero and any mcal_flags == 0 quality cut passed every object, failed fits included. Restores the v1 contract: mcal_flags = bitwise OR of all per-type fit flags, so failed objects (now NaN-recorded rather than crashing) carry nonzero flags.
The cfis_simu configs still used the removed LOG_WCS/ME_LOG_WCS options, so the runners' positional reads (ngmix input_file_list[6], mccd_interp/ vignetmaker [1], sextractor [-1] with MAKE_POST_PROCESS) would IndexError or grab the wrong file. Migrated them to the merge_headers_runner input pattern used by example/cfis; sextractor exposure runs gain an explicit FILE_EXT so the 3-entry pattern override no longer mismatches the runner's 4-entry default. Also renamed stale run_sp_exp_Mh references in the cfis templates to run_sp_tile_Mh_exp, the name config_tile_Mh_exp.ini actually produces.
The column now stores ngmix 2.x's nfev (solver function-evaluation count, ~tens-hundreds, -1 on some failures), not the v1 1-5 retry count; the old name misrepresented the value. No downstream consumer reads it: make_cat's _save_ngmix_data never touches it, and the only ntry matches in sp_validation are base64 image blobs in notebook outputs.
copyfile was orphaned by the resume-path removal; Tile_cat's size/e/ theta attributes were read from the catalog but never consumed anywhere in src/ or scripts/. get_noise stays: scripts/jupyter/ test_centroid_shift.py imports and calls it.
Review — part 3 (fresh pass)Provenance: same convention as parts 1–2 — this is a fresh full-diff pass by Claude working on candide, against head b2dcd79. Every finding below was empirically demonstrated before being fixed, each fix carries a regression test confirmed red on the unfixed code, and the chain is now pushed to this branch ( Blockers (2)Both live in the v2.0 rewrite's tile flow, and both survive a green suite and an easy-object smoke run — which is exactly why they hid: one sits in a path no test exercised, the other only fires when a fit fails. The fixes restore what we read as the v1 contract, but they deserve your sanity-check of intent, Martin.
Should-fixes (pushed; please confirm intent)
Noted, not changed
Branch state, empiricallyWith the chain applied the container suite is 270 passed / 1 failed — the one failure is the known — Claude on behalf of Cail |
|
My response to the review: Blockers:
Should-fixes
Noted
|
uv.lock conflict resolved by regenerating from the merged pyproject (uv lock): branch's ngmix>=2.4 / esheldon v2.4.0 pin preserved, dependabot lockfile-group bumps from #742 absorbed. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Same latent FILE_PATTERN/FILE_EXT length mismatch 3eb6a66 fixed elsewhere: the 3-entry FILE_PATTERN override fell back on the decorator's new 4-entry FILE_EXT default, tripping the length check at startup. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Replies point by point: Blockers — thanks for the empirical confirmation on the TILE_ID truncation; matches what we traced this afternoon (the cap is mcal_flags — intent confirmed, nothing further to do. Config files —
Incoming on this branch: a GalSim noise-injection validation of the background-RMS inverse-variance weights (per this afternoon's discussion) — known profile + known per-pixel noise, accuracy and pull-calibration checks, with a binary-weights control under spatially-varying noise so the test fails if the weight wiring were wrong. Numbers will be posted here when it lands. — Claude (Fable), on behalf of Cail |
…ducer run_sp_MiViSmVi was deleted with its config (ff612b7); the surviving background_rms vignet producers run as run_sp_tile_PiViVi (matching config_tile_Ng_batch_psfex_uc.ini). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The rms branch drew the metacal noise image at a scalar median(rms) while the weights claim per-pixel variance; under a strong RMS gradient (factor 8 across a 48px stamp) fixnoise then floods the low-noise half and the ivar weights measure WORSE than binary weights through metacal (noshear shape scatter 0.116 vs 0.078 over 50 paired realisations). Drawing the noise image from the rms map itself restores the advantage (0.053 vs 0.078); constant maps are bit-identical to the old behaviour. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Known Gaussian galaxies with noise of known per-pixel variance run through the module's own observation/weight building and fitter stack (do_ngmix_metacal's runner construction is factored into make_runners so the test fits with literally the module's configuration). Teeth: mean reduced chi2 = 1.00 only if the weights are the true inverse variance (probed breakages land at 3.0-1.5e5; ngmix's chi2-rescaled covariance makes pulls blind to this), and under a factor-8 RMS gradient the ivar weights must beat MegaPipe-style binary weights on paired realisations (scatter ratio 0.59 direct, 0.68 through metacal; inverted or transposed maps push it past 1.6). Accuracy and pull calibration asserted per case. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…pe, pull-tol derivation) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
The GalSim noise-injection validation has landed ( The finding. In What the test pins (N=50 seeded realizations per case, true g=(+0.03,−0.02), Gaussian galaxy × Gaussian PSF, observations built by the module's own
The teeth — injected wiring bugs, all caught: ×4 normalization → χ²/dof 4.0; 1/rms instead of 1/rms² → 13.2; inverted weights → 1.5×10⁵ (and scatter ratio > 1.6); transposed rms map → 13.9. One subtlety worth knowing: pull distributions alone can be laundered (ngmix rescales the parameter covariance by reduced χ²), so the absolute-normalization assertion rides on χ²/dof — the one statistic that can't hide a mis-scaled weight map. Honest boundaries: the test enters at 11 tests, ~35 s; full suite 274 passed, no regressions. This file is also the natural home for the r50/T semantics regression checks going forward. — Claude (Fable), on behalf of Cail |
|
One more finding from tonight's star-validation investigation (prep for tomorrow's session with Fabian), orthogonal to the weight work but real: the module's PSF-model fit is prior-dominated. PSF observations are built with unit-flux stamps and no weight map, so the likelihood is much weaker than the prior — a PSF drawn with g1 = 0.025 fits to g1 = 9×10⁻⁵ (HSM confirms the stamp itself carries g1 = 0.0241; removing the prior recovers g = (0.0241, −0.0144)). Consequences: the Possibly also relevant to Monday's m-bias discussion: the 2023 star-test failure (#656, still open) links to esheldon/ngmix#72 — metacal m ≈ −0.2 when the jacobian departs from a simple pixel scale. The WCS-passing question and the m-bias question may be the same thread; we're checking the WCS-delivery hypothesis against Fabian's run artifacts tomorrow. — Claude (Fable), on behalf of Cail |
Importing mpi4py initializes MPI at import time, which aborts the whole process when Open MPI detects a SLURM step environment but no PMI server — i.e. inside any srun-launched shell on a cluster whose container OMPI lacks SLURM PMI support. Even shapepipe_run -h dies before printing (#744; empirically bisected to SLURM_STEP_ID being set). Gate the import on launcher-set env vars (OMPI_COMM_WORLD_SIZE for mpirun, PMI_RANK for srun --mpi=pmi2, PMIX_RANK for srun --mpi=pmix). A bare shapepipe_run never touches MPI and runs SMP as before; mpirun launches are unchanged. Verified on candide: bare run under srun now exits 0 (was OPAL abort), mpirun -n 1 and -n 2 paths intact. Behavior note: a config with MODE = MPI launched without any MPI launcher now falls back to SMP instead of running single-rank MPI. Closes #744 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The else-branch of the mpi4py dependency line appended the whole list to itself (harmless only because DependencyHandler dedups); with the launcher gate this became the common path. And the regression test now surfaces the subprocess stderr on failure instead of swallowing it in a bare CalledProcessError. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Gate mpi4py import on MPI launcher environment so bare shapepipe_run inside an srun shell no longer aborts in MPI_Init. Same commits as PR #747; folded here so the fix ships with the ngmix v2.0 line.
The exclusive input-ID flag and the NUMBER_LIST config option converged in FileHandler._format_process_list and did the same thing for a single ID; NUMBER_LIST is now the one mechanism. Pipeline: - remove -e/--exclusive from args.py and its plumbing through run.py, FileHandler, and JobHandler (where it was stored but never used) - NUMBER_LIST entries are now validated against the input file numbers found on disk, preserving -e's early failure on a wrong ID: the run aborts at start-up instead of when a module first opens files - unit tests for the validation (subset passes, typo raises, no-list scan path unchanged) Canfar chain (script-level -e options are unchanged; one ID per headless job remains the interface): - job_sp_canfar.bash, job_sp_canfar_v2.0.bash, and init_run_exclusive_canfar.sh write NUMBER_LIST into a per-job config copy (set_config_number_list: insert-or-replace under [FILE], ID in numbering-scheme form: leading dash, dots->dashes) instead of passing -e to shapepipe_run. Side benefit: the processed ID is recorded in the config copied to the run's log directory. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Review hardening: set_config_number_list now verifies the key landed in the config copy and aborts otherwise — a config with no NUMBER_LIST line and no bare [FILE] header would previously install an unmodified copy and silently process every image. Also fix init_run_exclusive_canfar.sh's missing-ID guard, which tested an unset variable and never fired. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…exclusive Replace -e/--exclusive flag with NUMBER_LIST (#746)
What this is
A same-repo mirror of #740 (@martinkilbinger's "Ngmix v2.0"), pushed to a branch on
CosmoStat/shapepipeso that CI actually runs. All 57 commits are authored by Martin (and carry Lucy Baumont's and Axel Guinot's work) — pushing the branch preserves that authorship unchanged; the only thing this PR adds is a same-repo head so GitHub Actions fires thepull_requestworkflow without the fork-PR approval gate. #740 received no CI runs at all for this reason.Substance is identical to #740 — see that PR for the full description. In short: upgrade ngmix to 2.4.0 and adopt Lucy's new ngmix classes/interface; overhaul the shape-measurement module; centroid-bias fix + validation; v2.0 production-run plumbing.
Going forward, opening PRs directly on
CosmoStat/shapepipe(rather than from a fork) avoids this — fork PRs don't trigger our Docker-image CI without a maintainer approval that wasn't happening.Closes/supersedes #740 once CI is green (leaving that call to Martin).
Review
A detailed review is on its way (read against Martin's checklist plus a science-quality pass). Headline from exercising the new fitter against real ngmix 2.4.0 on candide: the metacal path runs end-to-end and shear recovery is unbiased at the few×10⁻⁴ level in m. Full notes to follow.
— Claude on behalf of Cail