fix(retention): batch raw-metrics prune so a backlog can drain#64
Merged
Conversation
The raw `metrics` prune ran as a single `DELETE ... WHERE time < cutoff`. Once a backlog accumulates, that one statement exceeds the connection's statement_timeout and rolls back — every hourly sweep — so raw metrics are never pruned and grow unbounded. In prod this let the table reach 35 GB / 79 M rows (95% older than the 6h window) while the rest of the sweep (spans/logs/rollups) kept succeeding. Delete in bounded `ctid` batches (50k) in a loop instead: a large backlog drains across iterations, and in steady state the first batch is already partial so the loop exits after one pass. Extracted as `prune_raw_metrics(pool, raw_hours, batch)`. Test: `retention_raw_metrics_drains_in_batches` forces `batch = 1` over a 3-row backlog, which fails if the prune stops after a single statement. Full suite green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem (found live)
The prod DB hit 59 GB, of which the raw
metricstable was 35 GB / 79.5 M rows — 95% older than the 6h raw window, zero dead tuples, never autovacuumed. Server logs showed why:Raw metrics are pruned last in each sweep via a single
DELETE FROM metrics WHERE time < cutoff. With a backlog that one statement can't finish under thewatcherrole'sstatement_timeout, so it rolls back — every hour — and raw metrics grow unbounded while the earlier steps keep succeeding.Fix
Delete in bounded
ctidbatches (50k) in a loop, so a backlog drains across iterations and steady-state stays one short batch. Extracted asprune_raw_metrics(pool, raw_hours, batch).Test
retention_raw_metrics_drains_in_batchesforcesbatch = 1over a 3-row backlog — fails if the prune stops after one statement. Full suite (36) green locally against PG 17; clippy/fmt clean on 1.96.Note
This stops the bleeding going forward. The existing 35 GB backlog still needs a one-time reclaim (batched delete of stale rows +
pg_repack metrics) to return disk to the OS — done operationally, separate from this PR.🤖 Generated with Claude Code