Skip to content

Ngmix v2.0 (CI mirror of #740)#741

Open
cailmdaley wants to merge 98 commits into
developfrom
ngmix_v2.0
Open

Ngmix v2.0 (CI mirror of #740)#741
cailmdaley wants to merge 98 commits into
developfrom
ngmix_v2.0

Conversation

@cailmdaley

Copy link
Copy Markdown
Contributor

What this is

A same-repo mirror of #740 (@martinkilbinger's "Ngmix v2.0"), pushed to a branch on CosmoStat/shapepipe so that CI actually runs. All 57 commits are authored by Martin (and carry Lucy Baumont's and Axel Guinot's work) — pushing the branch preserves that authorship unchanged; the only thing this PR adds is a same-repo head so GitHub Actions fires the pull_request workflow without the fork-PR approval gate. #740 received no CI runs at all for this reason.

Substance is identical to #740 — see that PR for the full description. In short: upgrade ngmix to 2.4.0 and adopt Lucy's new ngmix classes/interface; overhaul the shape-measurement module; centroid-bias fix + validation; v2.0 production-run plumbing.

Going forward, opening PRs directly on CosmoStat/shapepipe (rather than from a fork) avoids this — fork PRs don't trigger our Docker-image CI without a maintainer approval that wasn't happening.

Closes/supersedes #740 once CI is green (leaving that call to Martin).

Review

A detailed review is on its way (read against Martin's checklist plus a science-quality pass). Headline from exercising the new fitter against real ngmix 2.4.0 on candide: the metacal path runs end-to-end and shear recovery is unbiased at the few×10⁻⁴ level in m. Full notes to follow.

— Claude on behalf of Cail

Lucie Baumont and others added 30 commits January 9, 2023 14:07
- bin/ scripts were untracked, causing Docker build to fail
- Fix license field to use SPDX string format (MIT) to resolve
  SetuptoolsDeprecationWarning

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@cailmdaley

Copy link
Copy Markdown
Contributor Author

Review-round summary. Every open thread above now has a reply; here's the shape of what landed on ngmix_v2.0 since the review, and what still awaits your judgment.

On this branch:

  • Reproducibility — seeded rng threaded through the noise draws (22f1f0a), guarded by test_metacal_is_reproducible_with_fixed_seed.
  • Dead-code cleanup you greenlit (bd60dc8): CHECK_EXISTING_DIR resume path + get_last_id, the unreachable print, unused sextractor_e1e2, scripts/python/fitting.py, and the 51*51 literal (now v_flag_tmp.size). Dangling config entry pruned in 16789eb.
  • Inverse-variance weights per [NEW FEATURE] Weight Handling #604 (f466c98 + 6ddea5b, 4aa3b2a, 158986a, f79de46, 0bc6016): optional BKG_RMS_VIGNET_PATH gives per-pixel 1/(Fscale·RMS)² with gating masks; scalar σ_MAD fallback hardened; the chain pinned end-to-end by an integration test. The double-weighting flagged in review is gone.
  • Runner contract (d4ada3d, b2dcd79): sextractor_runner's decorator now declares the positional headers input; stale LOG_WCS docs fixed in both packages. No behavior change in any shipped config (all 12 override FILE_PATTERN).

Companion PRs (the r50/T thread):

Awaiting your call (methodology, no code pushed): the *_psfo columns carrying the metacal-reconvolved PSF vs. the documented psfex/mccd input PSF; the galaxy prior/FLUX_AUTO guess reused for the PSF fit; and whether you want an explicit STAMP_SIZE option despite vignetmaker owning the geometry upstream. The zero-pixel any→all change stays as you intended; a fractional-bad-pixel threshold is noted as a possible follow-up.

— Claude on behalf of Cail

cailmdaley and others added 7 commits June 11, 2026 03:06
merge_headers writes TILE_ID as the first key of the tile-level
log_exp_headers<tile>.sqlite, and make_post_process derived n_hdu from
the first key's value — len(tile_id_string) instead of the CCD count —
so every epoch on the unscanned CCDs was silently dropped from
N_EPOCH/EPOCH_* (and hence from ME vignets and shape measurement).
Regression test builds the tile-mode sqlite via merge_headers and
asserts an object on the last CCD keeps all its epochs.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
ngmix 2.x run_fitter returns flags != 0 on failure instead of raising,
and the failed result carries none of the measurement keys (g, g_cov, T,
T_err, flux, flux_err, s2n); compile_results indexed them directly, so a
single failed object crashed the whole tile with a KeyError at save time.
Failed types are now recorded as NaN with their flags preserved.
With ignore_failed_psf=True a failed PSF epoch stays in obsdict carrying
only flags/pars, so reading result['T'] KeyError-dropped the whole object
even when the shear fit succeeded on the surviving epochs. The average
now skips flags != 0 epochs (all-failed still hits the wsum == 0 guard),
and n_epoch_model counts surviving epochs instead of submitted ones.
The rewrite hard-coded res['mcal_flags'] = 0, so the NGMIX_MCAL_FLAGS
column written by make_cat was constant-zero and any mcal_flags == 0
quality cut passed every object, failed fits included. Restores the v1
contract: mcal_flags = bitwise OR of all per-type fit flags, so failed
objects (now NaN-recorded rather than crashing) carry nonzero flags.
The cfis_simu configs still used the removed LOG_WCS/ME_LOG_WCS options,
so the runners' positional reads (ngmix input_file_list[6], mccd_interp/
vignetmaker [1], sextractor [-1] with MAKE_POST_PROCESS) would IndexError
or grab the wrong file. Migrated them to the merge_headers_runner input
pattern used by example/cfis; sextractor exposure runs gain an explicit
FILE_EXT so the 3-entry pattern override no longer mismatches the
runner's 4-entry default. Also renamed stale run_sp_exp_Mh references in
the cfis templates to run_sp_tile_Mh_exp, the name config_tile_Mh_exp.ini
actually produces.
The column now stores ngmix 2.x's nfev (solver function-evaluation
count, ~tens-hundreds, -1 on some failures), not the v1 1-5 retry
count; the old name misrepresented the value. No downstream consumer
reads it: make_cat's _save_ngmix_data never touches it, and the only
ntry matches in sp_validation are base64 image blobs in notebook
outputs.
copyfile was orphaned by the resume-path removal; Tile_cat's size/e/
theta attributes were read from the catalog but never consumed anywhere
in src/ or scripts/. get_noise stays: scripts/jupyter/
test_centroid_shift.py imports and calls it.
@cailmdaley

Copy link
Copy Markdown
Contributor Author

Review — part 3 (fresh pass)

Provenance: same convention as parts 1–2 — this is a fresh full-diff pass by Claude working on candide, against head b2dcd79. Every finding below was empirically demonstrated before being fixed, each fix carries a regression test confirmed red on the unfixed code, and the chain is now pushed to this branch (b2dcd793..05f1584e). I haven't hand-edited it line by line, so please read with that lens.

Blockers (2)

Both live in the v2.0 rewrite's tile flow, and both survive a green suite and an easy-object smoke run — which is exactly why they hid: one sits in a path no test exercised, the other only fires when a fit fails. The fixes restore what we read as the v1 contract, but they deserve your sanity-check of intent, Martin.

  1. TILE_ID silently truncates the post-process CCD scan (sextractor_script.py, make_post_process — fixed in fcd117f). The tile-level merge_headers (new on this branch) writes TILE_ID as the first key of log_exp_headers<tile>.sqlite; make_post_process derived n_hdu = len(f_wcs[key_list[0]]) — i.e. the length of the tile ID string instead of the CCD count. Every epoch on CCDs ≥ len(tile_string) silently vanishes from N_EPOCH/EPOCH_*, and from there out of ME vignets and shape measurement. Repro: tile "51" (len 2), 3 CCDs → an object on CCD 2 comes back N_EPOCH = 0 instead of 2. The fix filters the metadata key and sizes the scan from a real exposure entry; we audited every other log_exp_headers consumer (psfex_interp, vignetmaker, ngmix, mccd interp) — all do keyed [exp][ccd] access, so make_post_process was the only affected site, and it covers both its callers.

  2. compile_results KeyErrors on any failed fit type, discarding the whole tile at save (ngmix.py — fixed in 450f3c1). Re-verified against real ngmix 2.4.0: a failed run_fitter returns normally with flags=512 and keys [errmsg, flags, ier, model, nfev, pars, pars_cov, pars_err] — none of g/g_cov/T/T_err/flux/flux_err/s2n. The old results[idx][name]["flux"] access then KeyErrors, and because it happens at batch/final save, one hard object throws away every measured object in the tile. v1 never saw this because ngmix 1.x raised on failure, which the per-object try/except caught. Now failed types are recorded with their flags and NaN-filled measurement columns (a flags == 0 result missing s2n still KeyErrors, deliberately).

Should-fixes (pushed; please confirm intent)

  • mcal_flags was hardcoded to 0 (7a1b3b1). process() set res['mcal_flags'] = 0 unconditionally, so the NGMIX_MCAL_FLAGS column make_cat writes was constant zero and any mcal_flags == 0 quality cut passed everything. Restored to the v1 contract — bitwise OR of the per-type fit flags (new get_mcal_flags helper). Flagging explicitly: confirm this is the intended semantics for that column.
  • Failed-PSF epochs crashed (or would have contaminated) the multi-epoch PSF average (00e9f89). With ignore_failed_psf=True a failed-PSF epoch stays in the obs dict carrying only flags/pars; average_multiepoch_psf read its ['T'] → KeyError, dropping the object. It now skips flags != 0 epochs, and n_epoch_model is again the count of epochs that survived the PSF fit (the v1 meaning) rather than the number submitted.
  • Config drift, tutorial-facing (3eb6a66). The four example/cfis_simu configs driven by job_sp_simu.bash still used the removed LOG_WCS/ME_LOG_WCS options; they're migrated to the positional WCS-headers input contract, mirroring the migrated example/cfis equivalents. Same commit renames run_sp_exp_Mhrun_sp_tile_Mh_exp in five example/cfis configs (the run name config_tile_Mh_exp.ini actually produces) and adds explicit FILE_EXT where a 3-entry FILE_PATTERN override would trip the new 4-entry decorator default's length check at startup. Related sweep worth doing in this PR: example/cfis/config_exp_psfex.ini has the same latent FILE_PATTERN/FILE_EXT length mismatch.

Noted, not changed

  • moments_fail semantic drift (document-only): in v1 the column counted moments-initial-guess (get_guess) failures; on this branch it counts metacal types with nonzero fit flags. Same name, different meaning — flagged so downstream consumers know.
  • ntry_fitnfev_fit (436bcc8): the value written is now ngmix-2.x's nfev (function evaluations), not the v1 retry count (1–5), so the old name was misleading. We verified the column is unconsumed downstream (make_cat never reads it; no reference anywhere in sp_validation), hence an honest rename rather than a compatibility shim. A small dead-code sweep rode along (05f1584): the orphaned copyfile import and the never-read Tile_cat.size/.e/.theta; get_noise was kept — it has a live caller in scripts/jupyter/test_centroid_shift.py.

Branch state, empirically

With the chain applied the container suite is 270 passed / 1 failed — the one failure is the known test_metacal_is_reproducible_with_fixed_seed, which needs ngmix 2.x the test sandbox doesn't have. Under real ngmix 2.4.0 we ran do_ngmix_metacal directly: seed-reproducibility holds, and an injected shear is recovered at g1 = 0.01996 vs 0.02 after response correction.

— Claude on behalf of Cail

@martinkilbinger

Copy link
Copy Markdown
Contributor

My response to the review:

Blockers:

  1. TILE_ID truncation of exposure list. I confirmed, only 7 HDUs, one for each contributing [exposure-ccd], was added to the SExtractor output in post-processing. Good catch and fix.
  2. Good fix.

Should-fixes

  • mcal_flags: yes, bitwise-or is what is suppoed to be done
  • failed PSF epoch: Fix looks good.
  • config files: the obsolete example/cfis_simu and example/cfis/config_tile_MiViSmVi.ini are now removed
  • example/cfis/config_exp_psfex.ini: where is the problem exactly?

Noted

  • moments_fail: where is this being flagged?
  • ntry_fit → nfev_fit: sounds good.

cailmdaley and others added 2 commits June 11, 2026 14:55
uv.lock conflict resolved by regenerating from the merged pyproject
(uv lock): branch's ngmix>=2.4 / esheldon v2.4.0 pin preserved,
dependabot lockfile-group bumps from #742 absorbed.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Same latent FILE_PATTERN/FILE_EXT length mismatch 3eb6a66 fixed
elsewhere: the 3-entry FILE_PATTERN override fell back on the
decorator's new 4-entry FILE_EXT default, tripping the length check
at startup.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@cailmdaley

Copy link
Copy Markdown
Contributor Author

Replies point by point:

Blockers — thanks for the empirical confirmation on the TILE_ID truncation; matches what we traced this afternoon (the cap is len("<tile_id>") = 7 applied to the per-exposure CCD scan in make_post_process; the regression test pins an object on the last CCD keeping all epochs).

mcal_flags — intent confirmed, nothing further to do.

Config filesexample/cfis_simu actually arrived deleted via the develop merge I pushed (061ca3c — that merge also resolves the PR's conflicts: uv.lock regenerated from the merged pyproject, ngmix pin at upstream v2.4.0 preserved, the #742 dependabot bumps absorbed). Your config_tile_MiViSmVi.ini deletion rode in cleanly under it.

config_exp_psfex.ini — where exactly: [SEXTRACTOR_RUNNER] overrides FILE_PATTERN = image, weight, pipeline_flag (3 entries) with no FILE_EXT, so it falls back on the decorator's new 4-entry default ([".fits", ".fits", ".fits", ".sqlite"] since WCS headers became a positional input) — the pattern/ext length check then fails at startup. Fixed in 6f4901b with an explicit FILE_EXT = .fits, .fits, .fits, same treatment 3eb6a66 gave the other configs.

moments_fail — where it's flagged: only in the review text, no code marker — sorry for the ambiguous wording. The state: it's computed at ngmix.py:669 as the count of metacal types with nonzero fit flags, and consumed at make_cat.py:358 into the final catalogue. In v1 the same column counted moments-initial-guess failures — so the name no longer matches the meaning, and unlike ntry_fit it does reach the catalogue. Two options: (a) keep the name and document the new meaning, or (b) rename through ngmix + make_cat (e.g. mcal_types_fail). Your call — happy to do either.

Incoming on this branch: a GalSim noise-injection validation of the background-RMS inverse-variance weights (per this afternoon's discussion) — known profile + known per-pixel noise, accuracy and pull-calibration checks, with a binary-weights control under spatially-varying noise so the test fails if the weight wiring were wrong. Numbers will be posted here when it lands.

— Claude (Fable), on behalf of Cail

cailmdaley and others added 4 commits June 11, 2026 15:08
…ducer

run_sp_MiViSmVi was deleted with its config (ff612b7); the surviving
background_rms vignet producers run as run_sp_tile_PiViVi (matching
config_tile_Ng_batch_psfex_uc.ini).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The rms branch drew the metacal noise image at a scalar median(rms)
while the weights claim per-pixel variance; under a strong RMS gradient
(factor 8 across a 48px stamp) fixnoise then floods the low-noise half
and the ivar weights measure WORSE than binary weights through metacal
(noshear shape scatter 0.116 vs 0.078 over 50 paired realisations).
Drawing the noise image from the rms map itself restores the advantage
(0.053 vs 0.078); constant maps are bit-identical to the old behaviour.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Known Gaussian galaxies with noise of known per-pixel variance run
through the module's own observation/weight building and fitter stack
(do_ngmix_metacal's runner construction is factored into make_runners so
the test fits with literally the module's configuration). Teeth: mean
reduced chi2 = 1.00 only if the weights are the true inverse variance
(probed breakages land at 3.0-1.5e5; ngmix's chi2-rescaled covariance
makes pulls blind to this), and under a factor-8 RMS gradient the ivar
weights must beat MegaPipe-style binary weights on paired realisations
(scatter ratio 0.59 direct, 0.68 through metacal; inverted or transposed
maps push it past 1.6). Accuracy and pull calibration asserted per case.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…pe, pull-tol derivation)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@cailmdaley

Copy link
Copy Markdown
Contributor Author

The GalSim noise-injection validation has landed (c5a9cc4d, tightened in 6bae2a41) — and it earned its keep before arriving: it caught a real defect in the rms-weight path as previously merged (fixed in bb93c288, please review that one specifically).

The finding. In do_ngmix_metacal, the fixnoise noise image was drawn at a scalar median(rms). Under a strong noise gradient (factor-8 across the stamp) that floods the low-noise half with excess noise, and the ivar weights were measurably counterproductive end-to-end: shape scatter 0.116 (ivar) vs 0.078 (binary), ratio 1.48. Drawing the noise image per-pixel from the rms map restores the expected advantage: 0.053 vs 0.078 (ratio 0.684). Constant rms maps produce bit-identical draws to the old path, and bkg_rms=None is untouched.

What the test pins (N=50 seeded realizations per case, true g=(+0.03,−0.02), Gaussian galaxy × Gaussian PSF, observations built by the module's own make_ngmix_observation path):

  • Direct fits, ivar weights: mean g = (+0.0329, −0.0194) flat / (+0.0326, −0.0221) ramp — consistent with truth; pull std 1.08–1.12; χ²/dof = 1.0027 under both flat and factor-8 ramp noise.
  • Binary weights under the ramp (old MegaPipe-style): χ²/dof = 4.83, paired shape-scatter ratio ivar/binary = 0.586 (asserted < 0.8).
  • Metacal end-to-end (noshear): ivar/binary scatter ratio 0.684 (asserted < 0.9) after bb93c288.

The teeth — injected wiring bugs, all caught: ×4 normalization → χ²/dof 4.0; 1/rms instead of 1/rms² → 13.2; inverted weights → 1.5×10⁵ (and scatter ratio > 1.6); transposed rms map → 13.9. One subtlety worth knowing: pull distributions alone can be laundered (ngmix rescales the parameter covariance by reduced χ²), so the absolute-normalization assertion rides on χ²/dof — the one statistic that can't hide a mis-scaled weight map.

Honest boundaries: the test enters at make_ngmix_observation, so vignetmaker's upstream extraction/orientation of the BACKGROUND_RMS vignettes is not exercised (the transposed-map probe shows χ² would catch a misalignment injected at the module boundary). Metacal pull std sits at ~1.4–1.5 even with perfect weights and flat noise — fixnoise/deconvolution correlated-noise bookkeeping inside ngmix, independent of the weights — so pull-calibration assertions live on the direct-fit layer.

11 tests, ~35 s; full suite 274 passed, no regressions. This file is also the natural home for the r50/T semantics regression checks going forward.

— Claude (Fable), on behalf of Cail

@cailmdaley

Copy link
Copy Markdown
Contributor Author

One more finding from tonight's star-validation investigation (prep for tomorrow's session with Fabian), orthogonal to the weight work but real: the module's PSF-model fit is prior-dominated. PSF observations are built with unit-flux stamps and no weight map, so the likelihood is much weaker than the prior — a PSF drawn with g1 = 0.025 fits to g1 = 9×10⁻⁵ (HSM confirms the stamp itself carries g1 = 0.0241; removing the prior recovers g = (0.0241, −0.0144)). Consequences: the g_psfo/g_psf catalogue columns are biased round, and the deconvolution kernel is rounder than the true PSF — a PSF-leakage risk. Present on both the current tip and the pre-June branch state, so it is not from the recent fixes. Happy to open this as its own issue with the reproduction script (/n17data/cdaley/unions/scratch-wf/star-experiment/ on candide).

Possibly also relevant to Monday's m-bias discussion: the 2023 star-test failure (#656, still open) links to esheldon/ngmix#72 — metacal m ≈ −0.2 when the jacobian departs from a simple pixel scale. The WCS-passing question and the m-bias question may be the same thread; we're checking the WCS-delivery hypothesis against Fabian's run artifacts tomorrow.

— Claude (Fable), on behalf of Cail

cailmdaley and others added 3 commits June 11, 2026 17:14
Importing mpi4py initializes MPI at import time, which aborts the whole
process when Open MPI detects a SLURM step environment but no PMI server
— i.e. inside any srun-launched shell on a cluster whose container OMPI
lacks SLURM PMI support. Even shapepipe_run -h dies before printing
(#744; empirically bisected to SLURM_STEP_ID being set).

Gate the import on launcher-set env vars (OMPI_COMM_WORLD_SIZE for
mpirun, PMI_RANK for srun --mpi=pmi2, PMIX_RANK for srun --mpi=pmix).
A bare shapepipe_run never touches MPI and runs SMP as before; mpirun
launches are unchanged. Verified on candide: bare run under srun now
exits 0 (was OPAL abort), mpirun -n 1 and -n 2 paths intact.

Behavior note: a config with MODE = MPI launched without any MPI
launcher now falls back to SMP instead of running single-rank MPI.

Closes #744

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The else-branch of the mpi4py dependency line appended the whole list
to itself (harmless only because DependencyHandler dedups); with the
launcher gate this became the common path. And the regression test now
surfaces the subprocess stderr on failure instead of swallowing it in
a bare CalledProcessError.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Gate mpi4py import on MPI launcher environment so bare shapepipe_run
inside an srun shell no longer aborts in MPI_Init. Same commits as
PR #747; folded here so the fix ships with the ngmix v2.0 line.
The exclusive input-ID flag and the NUMBER_LIST config option converged
in FileHandler._format_process_list and did the same thing for a single
ID; NUMBER_LIST is now the one mechanism.

Pipeline:
- remove -e/--exclusive from args.py and its plumbing through run.py,
  FileHandler, and JobHandler (where it was stored but never used)
- NUMBER_LIST entries are now validated against the input file numbers
  found on disk, preserving -e's early failure on a wrong ID: the run
  aborts at start-up instead of when a module first opens files
- unit tests for the validation (subset passes, typo raises, no-list
  scan path unchanged)

Canfar chain (script-level -e options are unchanged; one ID per
headless job remains the interface):
- job_sp_canfar.bash, job_sp_canfar_v2.0.bash, and
  init_run_exclusive_canfar.sh write NUMBER_LIST into a per-job config
  copy (set_config_number_list: insert-or-replace under [FILE], ID in
  numbering-scheme form: leading dash, dots->dashes) instead of passing
  -e to shapepipe_run. Side benefit: the processed ID is recorded in
  the config copied to the run's log directory.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
cailmdaley and others added 2 commits June 11, 2026 23:09
Review hardening: set_config_number_list now verifies the key landed in
the config copy and aborts otherwise — a config with no NUMBER_LIST line
and no bare [FILE] header would previously install an unmodified copy
and silently process every image. Also fix init_run_exclusive_canfar.sh's
missing-ID guard, which tested an unset variable and never fired.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…exclusive

Replace -e/--exclusive flag with NUMBER_LIST (#746)
@cailmdaley cailmdaley mentioned this pull request Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants