Fix memory leak and stale-model results when hot-swapping PredictionEnginePool by kotlarmilos · Pull Request #7626 · dotnet/machinelearning

kotlarmilos · 2026-06-14T15:33:38Z

Description

This PR makes zero-downtime model hot-swapping in Microsoft.Extensions.ML safe under live traffic. Before this change, reloading a PredictionEnginePool had three problems. It leaked native memory, because the pool was built with new DefaultObjectPool, which never disposes the prediction engines it holds, so every reload abandoned a generation of native engines. It could also serve predictions from the old model, because a returned engine always went back to the current pool, so an engine rented against the previous model could be handed out by the new pool after a swap. And it risked disposing the shared model out from under other engines, because each engine was created with ownsTransformer set to true and so believed it solely owned the single shared transformer.

This PR builds each pool generation with DefaultObjectPoolProvider, which returns the framework's DisposableObjectPool for disposable types and disposes engines for us. It routes each rented engine back to the exact generation it came from using a ConditionalWeakTable and disposes the old generation atomically on swap. It creates engines with ownsTransformer set to false, so PoolLoader owns the shared model's lifetime per generation and disposes it only when no surviving generation still uses it. PredictionEnginePool now also implements IDisposable, guards against use after disposal, and disposes its loaders so the file and uri watchers are torn down. New tests cover serving across an in-flight hot-swap, loader disposal, and a regression test that forces overflow disposal and confirms the shared model is never disposed.

Copilot

Pull request overview

This PR improves the reliability of PredictionEnginePool hot-swapping under live traffic by ensuring pooled PredictionEngine instances are returned to (and disposed with) the correct pool generation, avoiding stale-model reuse and unbounded native-memory growth. It also adds test coverage for hot-swap behavior and pool disposal, and introduces an Azure DevOps “report green” pipeline for Build Analysis.

Changes:

Add a generation-owning DisposablePredictionEnginePool and update PoolLoader to route rentals/returns to the originating generation.
Make PredictionEnginePool disposable and add tests covering in-flight usage across a hot-swap and disposal behavior.
Add build/report-green.yml to emit a build.completed event for docs/workflow-only PRs.

Show a summary per file

File	Description
test/Microsoft.Extensions.ML.Tests/PredictionEnginePoolTests.cs	Adds tests for hot-swap behavior and pool disposal.
src/Microsoft.Extensions.ML/PredictionEnginePool.cs	Implements `IDisposable` and routes rent/return through `PoolLoader` generation-aware APIs.
src/Microsoft.Extensions.ML/PoolLoader.cs	Introduces generation-aware `Get`/`Return` and swaps generations on reload.
src/Microsoft.Extensions.ML/DisposablePredictionEnginePool.cs	New pool implementation that disposes retained engines and drains on generation disposal.
build/report-green.yml	Adds an AzDO no-op pipeline for Build Analysis on build-excluded PRs.

Copilot's findings

Files reviewed: 5/5 changed files
Comments generated: 6

codecov · 2026-06-14T17:19:50Z

Codecov Report

❌ Patch coverage is 71.25749% with 48 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.60%. Comparing base (548d4d0) to head (57cf192).
⚠️ Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
src/Microsoft.Extensions.ML/PoolLoader.cs	56.60%	15 Missing and 8 partials ⚠️
...rc/Microsoft.Extensions.ML/PredictionEnginePool.cs	33.33%	17 Missing and 5 partials ⚠️
...t.Extensions.ML.Tests/PredictionEnginePoolTests.cs	96.15%	2 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #7626      +/-   ##
==========================================
+ Coverage   69.59%   69.60%   +0.01%     
==========================================
  Files        1484     1484              
  Lines      273606   273764     +158     
  Branches    27949    27969      +20     
==========================================
+ Hits       190410   190567     +157     
+ Misses      75832    75816      -16     
- Partials     7364     7381      +17

Flag	Coverage Δ
Debug	`69.60% <71.25%> (+0.01%)`	⬆️
production	`63.85% <49.43%> (+0.01%)`	⬆️
test	`89.64% <96.15%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...rosoft.Extensions.ML/PredictionEnginePoolPolicy.cs	`100.00% <100.00%> (+28.57%)`	⬆️
...t.Extensions.ML.Tests/PredictionEnginePoolTests.cs	`96.73% <96.15%> (-3.27%)`	⬇️
...rc/Microsoft.Extensions.ML/PredictionEnginePool.cs	`56.17% <33.33%> (+18.47%)`	⬆️
src/Microsoft.Extensions.ML/PoolLoader.cs	`62.85% <56.60%> (-7.15%)`	⬇️

... and 9 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

rosebyte · 2026-06-15T15:21:58Z

@kotlarmilos, I guess you already tried so asking mainly to get better context - there is a DisposableObjectPool already in the dotnet/aspnetcore and it should be actually used when we ask for a pool of a disposable type (see here) so is the problem really in the disposing and do we make the pool the right way or it's a bug in the pool we should report (not with fixing it temporarily with our own patched pool for sure)?

Edit: I actually see the place where we create DefaultObjectPool directly.

rosebyte

Left two comments to discuss.

…nginePool ## Summary Makes zero-downtime model hot-swapping in `Microsoft.Extensions.ML` safe under live traffic: `PredictionEnginePool` can now be reloaded continuously without leaking native memory, serving predictions from a stale model, or disposing a shared transformer out from under sibling engines. ## Problem `PredictionEngine<TData,TPrediction>` wraps native resources and is `IDisposable`. Three defects made hot-swap unsafe for a long-lived, high-throughput service: 1. **Native memory leak on reload.** `PoolLoader.LoadPool()` created the pool with `new DefaultObjectPool<>(...)` and, on reload, swapped in a new pool via `Interlocked.Exchange` without disposing the old one. `DefaultObjectPool<T>` never disposes the objects it retains, so every reload abandoned a generation of native engines to the GC finalizer (effectively unbounded growth at a 15-20 minute swap cadence). 2. **Cross-generation contamination (wrong-model results).** `ReturnPredictionEngine` always returned an engine to the *current* pool. An engine rented against the old model and returned after a swap landed in the *new* model's pool and was later handed to a caller, producing stale-model predictions. 3. **Shared-transformer double dispose.** `PredictionEnginePoolPolicy.Create()` built engines with the default `ownsTransformer: true`, so every engine in a generation believed it owned the single shared `ITransformer`. Once engines are actually disposed (e.g. on overflow), the first disposal tears down the shared model for all sibling engines. ## Change - `PoolLoader` now builds each pool generation with `DefaultObjectPoolProvider`, which returns the framework's `DisposableObjectPool` for `IDisposable` element types. That pool already disposes overflow engines, disposes engines returned after it has been disposed, and disposes everything it retains on `Dispose()` - so no custom pool type is needed. - `PoolLoader` is generation-aware: a `ConditionalWeakTable` records the generation each rented engine came from and routes it back to that exact generation on return, so an old-model engine is never mixed into the new pool. On reload the old generation is disposed atomically after the swap, and the generation's `ITransformer` is disposed with it (guarded by reference equality so a model still in use by the surviving generation is never disposed). - `PredictionEnginePoolPolicy.Create()` now passes `OwnsTransformer = false`; the shared model's lifetime is owned by `PoolLoader` (per generation) instead of by each engine. - `PredictionEnginePool` implements `IDisposable`, guards `GetPredictionEngine` against use after disposal, disposes (rather than throws on) engines returned after disposal, and disposes its loaders so the file/uri watchers and change-token registrations are torn down. ## Proof New tests: `pool_serves_predictions_across_a_hot_swap` (serve across an in-flight swap; stale engine not reused), `disposing_pool_releases_loader_resources`, and `pooled_engines_do_not_dispose_the_shared_model` (rents past the retention limit to force overflow disposal and asserts the shared transformer is never disposed; this test fails if `ownsTransformer` is left at its default of true). ``` Passed! - Failed: 0, Passed: 10, Skipped: 0, Total: 10 - Microsoft.Extensions.ML.Tests.dll (net8.0) ``` ## Risk Behavioral change limited to disposal/lifetime semantics. Public API addition only (`PredictionEnginePool` now `IDisposable`). No change to prediction results on the happy path; stale-model engines are simply no longer reused after a swap. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…Pool Make the _disposed flag an int updated via Interlocked.Exchange so Dispose runs exactly once, and read it with Volatile.Read on all hot paths to avoid races between rent/return, reload callbacks, and disposal. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

anicka-net

Reviewed the concurrency model — ConditionalWeakTable for generation tracking, idempotent dispose, OwnsTransformer=false to prevent shared model disposal. Tests cover hot-swap, double-dispose, and pool overflow. LGTM.

Copilot AI review requested due to automatic review settings June 14, 2026 15:33

Copilot started reviewing on behalf of kotlarmilos June 14, 2026 15:34 View session

dotnet-policy-service Bot assigned kotlarmilos Jun 14, 2026

kotlarmilos force-pushed the feature/extml-hot-swap-hardening branch from 25e5a39 to dbaddcf Compare June 14, 2026 15:35

Copilot AI reviewed Jun 14, 2026

View reviewed changes

kotlarmilos requested review from anicka-net and rosebyte June 14, 2026 15:47

kotlarmilos force-pushed the feature/extml-hot-swap-hardening branch from dbaddcf to e66c00f Compare June 14, 2026 15:53

build-analysis Bot mentioned this pull request Jun 14, 2026

The Operation will be canceled. The next steps may not contain expected logs. dotnet/dnceng#3008

Open

3 tasks

github-actions Bot mentioned this pull request Jun 15, 2026

🏥 Repo Health Dashboard — dotnet/machinelearning #7628

Open

rosebyte reviewed Jun 15, 2026

View reviewed changes

Comment thread src/Microsoft.Extensions.ML/PoolLoader.cs

rosebyte reviewed Jun 15, 2026

View reviewed changes

Comment thread src/Microsoft.Extensions.ML/PoolLoader.cs Outdated

rosebyte reviewed Jun 15, 2026

View reviewed changes

kotlarmilos force-pushed the feature/extml-hot-swap-hardening branch from e66c00f to d421aab Compare June 16, 2026 07:48

rosebyte reviewed Jun 16, 2026

View reviewed changes

Comment thread src/Microsoft.Extensions.ML/PoolLoader.cs Outdated

Comment thread src/Microsoft.Extensions.ML/PoolLoader.cs

Comment thread src/Microsoft.Extensions.ML/PoolLoader.cs

rosebyte approved these changes Jun 18, 2026

View reviewed changes

github-actions Bot mentioned this pull request Jun 18, 2026

🏥 Repo Health Dashboard — dotnet/machinelearning #7634

Open

anicka-net approved these changes Jun 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix memory leak and stale-model results when hot-swapping PredictionEnginePool#7626

Fix memory leak and stale-model results when hot-swapping PredictionEnginePool#7626
kotlarmilos wants to merge 2 commits into
dotnet:mainfrom
kotlarmilos:feature/extml-hot-swap-hardening

kotlarmilos commented Jun 14, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented Jun 14, 2026 •

edited

Loading

Uh oh!

rosebyte commented Jun 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

rosebyte left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anicka-net left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

kotlarmilos commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Copilot's findings

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

rosebyte commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rosebyte left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anicka-net left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kotlarmilos commented Jun 14, 2026 •

edited

Loading

codecov Bot commented Jun 14, 2026 •

edited

Loading

rosebyte commented Jun 15, 2026 •

edited

Loading