[BUG] - Fix three stale-cache bugs after swarm particle addition#216
[BUG] - Fix three stale-cache bugs after swarm particle addition#216bknight1 wants to merge 2 commits into
Conversation
Bug 1 — Stale kd-tree in add_particles_with_coordinates: add_particles_with_coordinates() uses self.dm.migrate() directly, bypassing Swarm.migrate() which calls _invalidate_canonical_data(). The manual invalidation at lines 3574-3577 nils _canonical_data but missed self._kdtree, causing RBF interpolation to use old particle coordinates/neighbors. Bug 2 — Stale cached projector in _project_to_work_variable: Cached Projection solvers on the mesh object were reused across evaluate() calls without _force_setup=True. After a Stokes solve modifies the DM, the cached projector's PETSc solver state is stale, causing deadlock on PETSc 3.22.2 and silent wrong results on 3.24.2. Bug 3 — Stale proxy mesh variable after swarm write (documented only): _update() marks proxy as stale, but the lazy re-interpolation (_update_proxy_if_stale()) only fires on material.sym access. Code reading the proxy MeshVariable DM directly (e.g. a Projection solver evaluating its uw_function) gets stale data. See GitHub issue #215. Underworld development team with AI support from Claude Code
There was a problem hiding this comment.
Pull request overview
Fixes two of three stale-cache bugs that occur after Swarm.add_particles_with_coordinates() is called: a stale kd-tree in the swarm, and a stale cached Projection solver used by uw.function.evaluate() for derivative/L2 expressions. A third bug (stale proxy MeshVariable DM after swarm writes) is documented via a TODO comment but not fixed; a workaround is recommended in the PR description. A new regression/stress test file is added.
Changes:
- Add
self._kdtree = Noneinvalidation inSwarm.add_particles_with_coordinates()so subsequent_get_kdtree()rebuilds against the new coordinates. - Force
_force_setup=Trueon the scalar and tensor cached projectors in_project_to_work_variable(), rebuilding SNES/DM state every call. - Add
tests/test_0112_swarm_add_particles.pycovering add-after-populate, proxy updates, MPI re-add cycles, kd-tree + save patterns, and a Stokes/projection/evaluate/write loop.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| src/underworld3/swarm.py | Adds kd-tree cache invalidation after the direct dm.migrate() in add_particles_with_coordinates, plus a TODO comment documenting the unfixed stale-proxy case. |
| src/underworld3/function/_function.pyx | Passes _force_setup=True to both cached Projection/MultiComponent_Projection solves to avoid stale solver state. |
| tests/test_0112_swarm_add_particles.py | New regression and stress tests for swarm particle addition, proxy updates, MPI patterns, and a Stokes+projection+evaluate cycle. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # _force_setup=True: rebuild solver state to avoid stale cached | ||
| # projector after Stokes/DM modifications (issue #215, Bug 2). | ||
| projector.solve(zero_init_guess=False, _force_setup=True) |
| # Invalidate cached data — particle count changed after addNPoints + migrate | ||
| self._particle_coordinates._canonical_data = None | ||
| self._kdtree = None # issue #215, Bug 1: stale kd-tree after add_particles_with_coordinates | ||
| for var in self._vars.values(): | ||
| if hasattr(var, "_canonical_data"): | ||
| var._canonical_data = None |
| @@ -3572,6 +3578,7 @@ def add_particles_with_coordinates(self, coordinatesArray) -> int: | |||
|
|
|||
| # Invalidate cached data — particle count changed after addNPoints + migrate | |||
| self._particle_coordinates._canonical_data = None | |||
| self._kdtree = None # issue #215, Bug 1: stale kd-tree after add_particles_with_coordinates | |||
| for var in self._vars.values(): | |||
| if hasattr(var, "_canonical_data"): | |||
| var._canonical_data = None | |||
| with swarm.access(material): | ||
| if n_added > 0: | ||
| material.data[old_size:old_size + n_added, 0] = material_init[re_add_mask, 0] |
| _log = [] | ||
| dbg = uw.mpi.rank == 0 and (lambda s: (_log.append(s), print(s, flush=True))[-1]) or (lambda s: None) |
Replace three manual canonical-data invalidation blocks with calls to _invalidate_canonical_data(), which now also marks swarm variable proxies as stale via _update(). This ensures the proxy mesh variable is re-interpolated after migration, particle addition, and snapshot restore. Affected callers: - Swarm.migrate() — already used _invalidate_canonical_data - add_particles_with_coordinates — replaced manual block - apply_snapshot_payload — replaced manual block - disk_snapshot.py — replaced manual block Underworld development team with AI support from Claude Code
Review — good diagnosis, but the projector fix is too blunt (memory regression)The three stale-cache bugs in #215 are real and the swarm-side work here is good. One change needs rework before merge. ✅ Good
|
Summary
Fixes three stale-cache bugs that occur when a swarm is modified via
add_particles_with_coordinates()and then used for interpolation or projection. These affect the common pattern of re-adding particles to empty cells each timestep.Fixes #215
Bug 1 — Stale kd-tree in
add_particles_with_coordinatesadd_particles_with_coordinates()callsself.dm.migrate()directly, bypassingSwarm.migrate()which calls_invalidate_canonical_data(). The manual invalidation inadd_particles_with_coordinatesset_canonical_data = Nonefor coordinates and variables but missedself._kdtree. After particle addition,swarm._get_kdtree()returned a kd-tree built from old coordinates.Fix: Added
self._kdtree = Nonein the invalidation block inadd_particles_with_coordinates.Bug 2 — Stale cached projector in
_project_to_work_variable_project_to_work_variable()cachesProjectionsolver instances on the mesh object (_eval_projector_scalarand_eval_{shape}_projector). When reused acrossevaluate()calls after a Stokes solve modifies the DM, the cached projectors PETSc solver state is stale. On PETSc 3.22.2 (Setonix HPC) this causes an MPI deadlock; on PETSc 3.24.2 (macOS) it silently returns stale results.Fix: Added
_force_setup=Trueto both the scalar (line 640) and tensor (line 613)projector.solve()calls.Bug 3 — Stale proxy mesh variable after swarm write (documented, no fix)
The lazy proxy update pattern marks the proxy as stale on data write (
_proxy_stale = True), but the actual re-interpolation (_rbf_to_meshVar) only fires whenmaterial.symis accessed or_update_proxy_if_stale()is called. Code that reads the proxy MeshVariable DM directly (e.g. aProjectionsolver evaluating itsuw_functionat quadrature points) reads stale data.Workaround: Call
material._update_proxy_if_stale()(or accessmaterial.sym) after modifying a swarm variable and before using its proxy mesh variable.Test
Added
tests/test_0112_swarm_add_particles.pywithtest_proxy_updates_after_add_particlesthat verifies the proxy mesh variable correctly reflects swarm changes after adding particles.Platform-dependent behavior
See issue #215 comment for details. PETSc 3.22.2 deadlocks on these bugs; 3.24.2 silently returns wrong values.
Underworld development team with AI support from Claude Code