Skip to content

Fix cloud dev-stack degradation under sustained MCP load#1280

Merged
RhysSullivan merged 6 commits into
mainfrom
fix/devserver-sse-otel-growth
Jul 3, 2026
Merged

Fix cloud dev-stack degradation under sustained MCP load#1280
RhysSullivan merged 6 commits into
mainfrom
fix/devserver-sse-otel-growth

Conversation

@RhysSullivan

@RhysSullivan RhysSullivan commented Jul 3, 2026

Copy link
Copy Markdown
Owner

Summary

The cloud e2e dev stack degrades after a few minutes of sustained suite load and eventually hangs. This is the CI cascade flake: MCP-heavy shards (toolkits-mcp and friends) run, then unrelated specs die with Playwright TimeoutErrors because requests hang rather than erroring. Nondeterministic, worse on 2-core runners.

Root cause

The degradation is postgres connection churn against the dev PGlite server, not SSE frame buffering or OTel span growth. The prior work (SSE forwarder cap in agents@0.17.3, per-isolate mem counters) hardened real but separate paths; the remaining growth was in the DB layer. Two compounding problems:

  1. apps/cloud/scripts/dev-db.ts used the default maxConnections: 1. PGLiteSocketServer defaults to one connection and answers every extra concurrent connection with "Too many connections" + an immediate socket close. (pglite-socket 0.1.4's published index.d.ts documents "default: 100", but the shipped runtime JS is maxConnections ?? 1 — verified in the shipped chunk.) The cloud worker opens a fresh postgres pool per request — the MCP auth seam (apps/cloud/src/mcp/auth.ts makeMcpOrganizationAuthServices) rebuilds one on every /mcp request to respect workerd's per-request I/O rule. Under concurrent load (exactly what the e2e suite generates against one shared stack), the second-and-later connects were rejected, and postgres.js reconnected in a tight loop against the closed socket. That reconnect storm piled up thousands of half-closed sockets, starved real queries, drove per-request latency into the tens of seconds, and hung the stack. PGlite runs queries serially anyway (its QueryQueueManager uses runExclusive), so allowing many connections is safe — they queue instead of storming.

  2. The postgres pool finalizer in apps/cloud/src/db/db.ts was fire-and-forget. Effect.runFork(sql.end({ timeout: 0 })) returned before the socket was torn down, so closing sockets accumulated faster than the server reaped them. Awaiting a graceful end() inside the acquireRelease scope bounds the live-plus-closing socket count to what is actually in flight. (Runs in the request's own scope, so still respects workerd's per-request I/O rule.) Make the test suite parallel-safe and speed up CI #1273 on main independently made this finalizer awaited but kept timeout: 0; this branch extends it with a graceful drain ceiling and extracts it as a tested closePostgres().

Why the SSE/OTel hypothesis was off: the cloud e2e stack's OTel BatchSpanProcessor is bounded (maxQueueSize: 2048) and flushed per request; the SSE forwarder is already capped by PR #1257's patch. Neither grows unbounded. The unbounded resource was file descriptors / sockets against PGlite, driven by the per-request DB pool pattern hitting a single-connection dev server.

Evidence

Reproduced on a booted cloud stack driven by sustained MCP session churn (connect → initialize → list-tools → close). Representative apples-to-apples run at a realistic e2e-like rate (3 concurrent clients, 1s think-time between sessions, ~135s):

Metric Baseline (maxConnections=1) Fixed
Sessions completed ~18 (stalls) 347
Avg latency / session ~15,000 ms ~125 ms
p95 latency 20,000–50,000 ms ~150–350 ms
"Too many connections" thousands 0
workerd → PGlite sockets climbing into 5 figures bounded

Baseline stalls (session count flat, latency climbing, oscillating hangs) — the CI symptom. Fixed is flat and sustained.

Changes

  • apps/cloud/scripts/dev-db.ts — set maxConnections on the dev-db PGLiteSocketServer (default 1000, overridable via DEV_DB_MAX_CONNECTIONS).
  • apps/cloud/src/db/db.ts — extract closePostgres() that awaits a graceful sql.end() (short drain ceiling, not timeout: 0); use it as the pool finalizer.
  • apps/cloud/src/db/db.close.test.ts — new unit tests characterizing the finalizer contract the old fire-and-forget close violated: non-zero drain window, awaits teardown across microtask AND real wall-clock delays, and teardown errors don't fail the request scope. Fast, no live DB.

Verification

  • bun run typecheck (cloud): clean.
  • oxlint on changed files: clean.
  • apps/cloud/src/db/* tests: 22 passed (incl. the new 4).

Notes / unconfirmed

  • Prod impact: closePostgres is the finalizer for DbService.Live, which runs per-request on the prod API path too (via RequestScopedServicesLiverequestScopedMiddleware's Effect.scoped). The prod change is limited to awaiting the socket teardown that was previously abandoned — no functional change, and only a small bounded latency on scope close: end({ timeout }) is a Promise.race ceiling, not a fixed wait, so an idle connection's end() terminates immediately and resolves on socket close (sub-ms); the 5s ceiling only bounds a connection still mid-query. The maxConnections change is dev/CI-stack only (prod uses Hyperdrive).
  • A small residual error rate (~4%) remained in the fixed repro at 3-concurrent synthetic load (a few PGlite contention StorageErrors under back-to-back sessions); this is well below the real e2e serial rate and no longer produces the storm/hang. Worth a follow-up if it shows up, but out of scope here.
  • Draft: I have not run the full e2e cloud shards end-to-end in CI on this branch yet.

@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented Jul 3, 2026

Copy link
Copy Markdown

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Preview URL Updated (UTC)
✅ Deployment successful!
View logs
executor-marketing 4d07bf3 Commit Preview URL

Branch Preview URL
Jul 03 2026, 10:07 AM

@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented Jul 3, 2026

Copy link
Copy Markdown

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Updated (UTC)
✅ Deployment successful!
View logs
executor-cloud 4d07bf3 Jul 03 2026, 10:08 AM

The cloud e2e stack degrades after a few minutes of sustained suite load
and eventually hangs, which is the CI cascade flake: MCP-heavy shards
(toolkits-mcp et al) run, then unrelated specs die with Playwright
TimeoutErrors because requests hang rather than erroring.

Root cause is postgres connection churn against the dev PGlite server, not
SSE frame buffering or OTel span growth. Two compounding problems:

- The dev-db PGlite socket server (apps/cloud/scripts/dev-db.ts) used the
  library default maxConnections: 1. The cloud worker opens a fresh postgres
  pool per request (the MCP auth seam rebuilds one on every /mcp request to
  respect workerd's per-request I/O rule), so under concurrent load the
  second-and-later connects were answered with "Too many connections" and an
  immediate socket close. postgres.js then reconnected in a tight loop, piling
  up thousands of half-closed sockets, starving real queries and driving
  request latency into the tens of seconds until the stack hung. PGlite runs
  queries serially anyway (its QueryQueueManager uses runExclusive), so
  allowing many connections is safe: they queue instead of storming. Raise the
  cap (overridable via DEV_DB_MAX_CONNECTIONS).

- The postgres pool finalizer (apps/cloud/src/db/db.ts) was fire-and-forget:
  Effect.runFork(sql.end({ timeout: 0 })) returned before the socket was torn
  down, so closing sockets accumulated faster than the server reaped them.
  Await a graceful end() inside the acquireRelease scope so the number of
  live-plus-closing sockets stays bounded to what is actually in flight.
  Extracted as closePostgres() and covered by a unit test.

Measured on a booted cloud stack (3 concurrent MCP clients, 1s think-time):

  baseline (maxConnections=1): ~18 sessions in 135s, avg 15s/session,
    p95 up to 50s, latency climbing, stack stalling, thousands of
    "Too many connections".
  fixed: 347 sessions in 135s, avg ~125ms/session, p95 ~150ms, flat,
    zero "Too many connections".
@RhysSullivan RhysSullivan force-pushed the fix/devserver-sse-otel-growth branch from 4d88370 to df86ec5 Compare July 3, 2026 07:10
@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Cloudflare preview

Torn down — the PR is closed.

@pkg-pr-new

pkg-pr-new Bot commented Jul 3, 2026

Copy link
Copy Markdown

Open in StackBlitz

@executor-js/cli

npm i https://pkg.pr.new/@executor-js/cli@1280

@executor-js/config

npm i https://pkg.pr.new/@executor-js/config@1280

@executor-js/execution

npm i https://pkg.pr.new/@executor-js/execution@1280

@executor-js/sdk

npm i https://pkg.pr.new/@executor-js/sdk@1280

@executor-js/codemode-core

npm i https://pkg.pr.new/@executor-js/codemode-core@1280

@executor-js/runtime-quickjs

npm i https://pkg.pr.new/@executor-js/runtime-quickjs@1280

@executor-js/plugin-file-secrets

npm i https://pkg.pr.new/@executor-js/plugin-file-secrets@1280

@executor-js/plugin-graphql

npm i https://pkg.pr.new/@executor-js/plugin-graphql@1280

@executor-js/plugin-keychain

npm i https://pkg.pr.new/@executor-js/plugin-keychain@1280

@executor-js/plugin-mcp

npm i https://pkg.pr.new/@executor-js/plugin-mcp@1280

@executor-js/plugin-onepassword

npm i https://pkg.pr.new/@executor-js/plugin-onepassword@1280

@executor-js/plugin-openapi

npm i https://pkg.pr.new/@executor-js/plugin-openapi@1280

executor

npm i https://pkg.pr.new/executor@1280

commit: 4d07bf3

With the dev-db now accepting concurrent connections, three e2e specs kept
failing individually (one per shard, through all retries): policies-ui,
mcp-catalog-sync-ui, and oauth-callback-org-scope. Their failure artifacts all
showed the same server-side signature: random 500s on /api/tools,
/api/connections, /api/oauth/clients ("Failed to load tools" in the UI), backed
by PostgresError: bind message supplies N parameters, but prepared statement ""
requires M.

Root cause: pglite-socket 0.1.4's QueryQueueManager enqueues each postgres wire
FRAME (Parse, Bind, Execute, Sync) as its own queue entry against the single
shared PGlite session. Two concurrent connections' extended-protocol pipelines
interleave: A's Parse (5 params), B's Parse (1 param), then A's Bind hits B's
unnamed statement. Whichever request loses the race 500s. A standalone repro
(6 clients alternating 1-param and 5-param unprepared queries, the exact
drizzle/postgres-js shape) corrupted 34.7% of queries.

Patch (patches/@electric-sql%2Fpglite-socket@0.1.4.patch):
- batch all frames parsed from one socket data event into ONE queue entry, so
  a client's Parse..Sync executes atomically under runExclusive;
- add pipeline affinity: while a handler's batch ended mid-pipeline (last frame
  not Sync/Query/Terminate), only that handler's queue entries are scheduled,
  mirroring the existing transaction affinity; released on detach with a
  synthesized Sync so an abandoned pipeline can't wedge the queue.

Same repro after the patch: 0% corrupted (106k queries clean). Regression test:
apps/cloud/src/db/dev-db-socket-concurrency.node.test.ts (fails on stock 0.1.4,
verified).

Also fixes the remaining race in cloud/oauth-callback-org-scope.test.ts, which
had a second, spec-local bug: it deep-linked to ?addAccount=1 while the session
cookie still pointed at org B, so the URL bounced (org A -> org B -> org A via
OrgSlugGate canonicalization) while the auto-opened modal fetched
/api/oauth/clients under the transient org-B scope, got an empty app list, and
left "Connect with OAuth" disabled forever. The spec now lands on the
integration page without the deep link, waits for org A's shell to settle,
opens the modal explicitly, and waits for the button to be enabled before
clicking. The pre-armed waitForRequest is also marked handled so a failed click
can no longer leak it as an unhandled rejection.

All three specs pass together locally against the patched stack.
A client that stalls mid-pipeline (Parse sent, no Sync) while keeping its
socket open would hold the query queue's handler affinity forever and starve
every other connection: affinity only releases on detach, and detach needs
close/error/idle-timeout. The dev-db previously disabled the idle timeout
(pglite-socket defaults to 0). Unreachable for postgres.js in practice (a
crashed client closes its socket), but a 30s idle timeout (env-overridable via
DEV_DB_IDLE_TIMEOUT_MS) makes the starvation impossible instead of merely
unlikely. The timer resets on every data event, so only a dead client trips it.
The e2e harness claimed a port block by connect-probing each candidate and
only binding the real services seconds later, a time-of-check/time-of-use gap.
Worse, the 42000-45999 block range sits entirely inside Linux's default
ephemeral range (32768-60999), so any outbound TCP the booting stack makes (DB
dials, emulator warmups, telemetry) can be assigned a claimed port as its local
port between probe and bind, EADDRINUSE-ing global setup before any test runs
with no other suite involved. A connect-probe can't even see such a squatter
since nothing is listening on it.

Two layers:

- claimPorts now BIND-probes each candidate (holding a listener on every port
  until the whole block binds, then closing them the instant before services
  take over) instead of connect-probing. Binds both 0.0.0.0 and :: so a
  same-port squatter on either family is caught (Node's default listen binds
  only ::, and an IPv4 listener coexists with it). This shrinks the race to
  microseconds and detects ephemeral-socket squatters.

- A bind-probe can't fully close the window, so add claimAndBoot: a bounded
  (3-attempt) retry wrapping claim + boot that, on EADDRINUSE, releases the
  block and re-claims the next one before retrying. Non-EADDRINUSE boot
  failures rethrow immediately. All four server-booting global setups (cloud,
  selfhost, cloudflare, selfhost-docker) route through it.

The 42000-45999 range is kept: it's fine for local macOS dev and moving it
churns every consumer. The bind-probe plus retry are what make it safe on CI.
The bind-probe added for the port-claim race bound each candidate port on
0.0.0.0 and then on a plain ::. On Linux a plain :: listen is dual-stack
(ipv6Only defaults to false), so it also claims the IPv4 side of the port and
EADDRINUSEs against our own just-bound 0.0.0.0 probe. Every block therefore
looked squatted, claimPorts walked all 400 blocks, and every e2e globalsetup
died with "range is exhausted". macOS's BSD semantics let that pair coexist,
which is why it passed local smoke testing.

Probe shape now, verified empirically on both macOS and Linux (node/bun in a
container):

- Connect-probe 127.0.0.1 and ::1 first: on macOS a specific-address listener
  (a leaked vite dev server) coexists with a wildcard bind, so only a connect
  can see it. Runs before our own wildcard binds would answer the loopback.
- Bind 0.0.0.0, then :: with ipv6Only: true. The v4 side and the v6-only side
  never overlap, so the pair coexists on both platforms while still catching
  squatters on either family, including ephemeral outbound sockets.

Smoke-tested in a Linux container: a clean environment claims its preferred
block with no walk (the shipped failure mode), service-shaped binds
(dual-stack :: and 127.0.0.1) succeed after the probes release, and squatters
on 0.0.0.0, dual-stack ::, and 127.0.0.1 are each detected.
The scenario landed on "/" with waitUntil networkidle and immediately read the
org slug from the pathname. networkidle proves the network went quiet, not
that the client-side canonicalization redirect ran: on a slow CI runner the
pathname is still "/" at that point, so split("/").filter(Boolean)[0] is
undefined and the test navigates to the literal /undefined/billing/plans. The
billing fetches fire under that bogus slug and cache their failure (autumn-js
staleTime 60s, no refetch on focus); the shell then canonicalizes the URL
client-side but nothing refetches, so the plans grid stays empty and the
"Start free trial" waitFor times out through every retry. The run artifacts
show exactly this: timeline.json records the /undefined/billing/plans
navigation and failure.png shows the plans page with an empty grid.

Wait for the URL to canonicalize onto the slug pattern before reading it, the
same barrier member-invite-seat-limit already uses.
@RhysSullivan RhysSullivan marked this pull request as ready for review July 3, 2026 10:17
@RhysSullivan RhysSullivan merged commit d74dc55 into main Jul 3, 2026
25 checks passed
@greptile-apps

greptile-apps Bot commented Jul 3, 2026

Copy link
Copy Markdown

Greptile Summary

This PR fixes the CI cascade flake where the cloud dev stack degraded under sustained MCP load by addressing two compounding root causes: the dev PGlite server's maxConnections defaulting to 1 (causing a reconnect storm under concurrent load), and the postgres pool finalizer being fire-and-forget (abandoned sockets piling up faster than the server reaped them).

  • DB layer (db.ts, dev-db.ts): Extracts closePostgres that awaits sql.end({ timeout: 5 }) instead of forking and forgetting, bounding live+closing sockets to what is actually in flight; the dev server now allows up to 1000 concurrent connections with an idle-timeout backstop.
  • pglite-socket patch: Fixes extended-query-protocol interleaving under multiple connections by batching all frames from one TCP data event into a single queue entry and adding per-handler pipeline affinity.
  • e2e port claiming: Replaces the old connect-probe with a bind-probe and adds claimAndBoot which retries on EADDRINUSE; also fixes two test-race bugs in the billing and OAuth e2e specs.

Confidence Score: 4/5

The two core fixes (awaited pool finalizer and raised maxConnections) are targeted, well-tested, and safe to merge.

The DB fixes and e2e race fixes are correct and well-characterized by tests. Three non-blocking issues were found: the concurrency regression test hardcodes a port inside the e2e claiming range, Number() coercion of DEV_DB_MAX_CONNECTIONS silently produces NaN on a bad value, and throw lastError after the claimAndBoot loop is unreachable dead code. None affect production paths.

apps/cloud/src/db/dev-db-socket-concurrency.node.test.ts (hardcoded port 45998 within the e2e claiming range) and apps/cloud/scripts/dev-db.ts (NaN guard on env-var coercion).

Important Files Changed

Filename Overview
apps/cloud/src/db/db.ts Extracts closePostgres with an awaited sql.end({ timeout: 5 }) replacing the fire-and-forget finalizer; exports POSTGRES_END_TIMEOUT_SECONDS for test visibility. Clean and well-documented.
apps/cloud/src/db/db.close.test.ts New unit tests for closePostgres contract: verifies non-zero drain window, microtask-level await ordering, real-clock await, and error swallowing against a fake sql object.
apps/cloud/scripts/dev-db.ts Sets maxConnections: 1000 and idleTimeout: 30_000 on the dev PGLiteSocketServer; Number() coercion of env vars can silently produce NaN, disabling the connection cap.
apps/cloud/src/db/dev-db-socket-concurrency.node.test.ts Regression test for the PGlite protocol-interleaving bug; hardcodes PORT 45998 which falls inside the e2e port claiming range (block 399, offset 8).
e2e/src/ports.ts Adds bindProbe and claimAndBoot; throw lastError after the loop is unreachable dead code since the last iteration always throws from inside the catch.
patches/@electric-sql%2Fpglite-socket@0.1.4.patch Patches pglite-socket's minified bundle to add frame batching and pipeline affinity, preventing extended-query-protocol interleaving under concurrent connections.
e2e/setup/cloud.globalsetup.ts Refactored to use claimAndBoot; release() is now owned by the claimAndBoot teardown.
e2e/cloud/billing-trial-checkout-stale.test.ts Adds waitForURL after networkidle to ensure the client-side redirect onto the org slug completes before reading the URL.
e2e/cloud/oauth-callback-org-scope.test.ts Fixes org-scope flake by waiting for org A's shell to settle before opening the modal, and pre-arms waitForRequest before clicking.

Reviews (1): Last reviewed commit: "Fix slug-read race in the billing trial-..." | Re-trigger Greptile

import { PGLiteSocketServer } from "@electric-sql/pglite-socket";
import postgres from "postgres";

const PORT = 45998;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Hardcoded port within e2e claiming range

PORT = 45998 maps to offset 8 of block 399 in the 42000–45999 e2e claiming range. If this test runs while the e2e globalsetup happens to walk to block 399 and claim that port, the server.listen() call on 127.0.0.1:45998 races with the e2e service binding. The new bindProbe would detect the unit-test server via the connect-probe and skip block 399, but the inverse — e2e claims the block first, releases the bind probes, and then the test tries to bind — would surface as a spurious test failure. Picking a port outside the claiming range (e.g. ≥46000 or <42000) removes the ambiguity entirely.

Comment thread e2e/src/ports.ts
);
}
}
throw lastError;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Unreachable throw lastError

The for loop's catch block always throws on the final attempt: !isAddrInUse(error) || attempt === maxAttempts is true on attempt maxAttempts regardless of error type. So the loop exits via throw inside the catch rather than by the loop condition, and the post-loop throw lastError is dead code. lastError could also theoretically be undefined if no iteration populates it (though that path is also unreachable). Removing this line makes the control flow explicit.

Comment on lines +113 to +119
maxConnections: Number(process.env.DEV_DB_MAX_CONNECTIONS ?? 1000),
// Backstop for pipeline affinity: a client that stalls mid-pipeline (Parse
// sent, no Sync) with its socket still OPEN would hold the queue's handler
// affinity forever and starve every other connection, since affinity only
// releases on detach and detach needs close/error/idle-timeout. In ms; the
// timer resets on every data event, so only a genuinely dead client trips it.
idleTimeout: Number(process.env.DEV_DB_IDLE_TIMEOUT_MS ?? 30_000),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Number() returns NaN for a non-numeric string, which causes this.handlers.size >= NaN to always be false, silently removing the connection cap. Using Number.isFinite with a fallback gives a clearly bounded result.

Suggested change
maxConnections: Number(process.env.DEV_DB_MAX_CONNECTIONS ?? 1000),
// Backstop for pipeline affinity: a client that stalls mid-pipeline (Parse
// sent, no Sync) with its socket still OPEN would hold the queue's handler
// affinity forever and starve every other connection, since affinity only
// releases on detach and detach needs close/error/idle-timeout. In ms; the
// timer resets on every data event, so only a genuinely dead client trips it.
idleTimeout: Number(process.env.DEV_DB_IDLE_TIMEOUT_MS ?? 30_000),
maxConnections: Number.isFinite(Number(process.env.DEV_DB_MAX_CONNECTIONS))
? Number(process.env.DEV_DB_MAX_CONNECTIONS)
: 1000,
// Backstop for pipeline affinity: a client that stalls mid-pipeline (Parse
// sent, no Sync) with its socket still OPEN would hold the queue's handler
// affinity forever and starve every other connection, since affinity only
// releases on detach and detach needs close/error/idle-timeout. In ms; the
// timer resets on every data event, so only a genuinely dead client trips it.
idleTimeout: Number.isFinite(Number(process.env.DEV_DB_IDLE_TIMEOUT_MS))
? Number(process.env.DEV_DB_IDLE_TIMEOUT_MS)
: 30_000,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant