fix(retention): batch raw-metrics prune so a backlog can drain by thejefflarson · Pull Request #64 · thejefflarson/watcher

thejefflarson · 2026-06-26T08:40:09Z

Problem (found live)

The prod DB hit 59 GB, of which the raw metrics table was 35 GB / 79.5 M rows — 95% older than the 6h raw window, zero dead tuples, never autovacuumed. Server logs showed why:

retention: pruned … spans / logs / metric_series_rollups   ✓
retention sweep failed: canceling statement due to statement timeout   ✗

Raw metrics are pruned last in each sweep via a single DELETE FROM metrics WHERE time < cutoff. With a backlog that one statement can't finish under the watcher role's statement_timeout, so it rolls back — every hour — and raw metrics grow unbounded while the earlier steps keep succeeding.

Fix

Delete in bounded ctid batches (50k) in a loop, so a backlog drains across iterations and steady-state stays one short batch. Extracted as prune_raw_metrics(pool, raw_hours, batch).

Test

retention_raw_metrics_drains_in_batches forces batch = 1 over a 3-row backlog — fails if the prune stops after one statement. Full suite (36) green locally against PG 17; clippy/fmt clean on 1.96.

Note

This stops the bleeding going forward. The existing 35 GB backlog still needs a one-time reclaim (batched delete of stale rows + pg_repack metrics) to return disk to the OS — done operationally, separate from this PR.

🤖 Generated with Claude Code

The raw `metrics` prune ran as a single `DELETE ... WHERE time < cutoff`. Once a backlog accumulates, that one statement exceeds the connection's statement_timeout and rolls back — every hourly sweep — so raw metrics are never pruned and grow unbounded. In prod this let the table reach 35 GB / 79 M rows (95% older than the 6h window) while the rest of the sweep (spans/logs/rollups) kept succeeding. Delete in bounded `ctid` batches (50k) in a loop instead: a large backlog drains across iterations, and in steady state the first batch is already partial so the loop exits after one pass. Extracted as `prune_raw_metrics(pool, raw_hours, batch)`. Test: `retention_raw_metrics_drains_in_batches` forces `batch = 1` over a 3-row backlog, which fails if the prune stops after a single statement. Full suite green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

thejefflarson merged commit ac950cc into main Jun 26, 2026
4 checks passed

thejefflarson deleted the fix-retention-batch-prune branch June 26, 2026 09:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(retention): batch raw-metrics prune so a backlog can drain#64

fix(retention): batch raw-metrics prune so a backlog can drain#64
thejefflarson merged 1 commit into
mainfrom
fix-retention-batch-prune

thejefflarson commented Jun 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

thejefflarson commented Jun 26, 2026

Problem (found live)

Fix

Test

Note

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant