Fix cloud dev-stack degradation under sustained MCP load#1280
Conversation
Deploying with
|
| Status | Name | Latest Commit | Preview URL | Updated (UTC) |
|---|---|---|---|---|
| ✅ Deployment successful! View logs |
executor-marketing | 4d07bf3 | Commit Preview URL Branch Preview URL |
Jul 03 2026, 10:07 AM |
Deploying with
|
| Status | Name | Latest Commit | Updated (UTC) |
|---|---|---|---|
| ✅ Deployment successful! View logs |
executor-cloud | 4d07bf3 | Jul 03 2026, 10:08 AM |
The cloud e2e stack degrades after a few minutes of sustained suite load
and eventually hangs, which is the CI cascade flake: MCP-heavy shards
(toolkits-mcp et al) run, then unrelated specs die with Playwright
TimeoutErrors because requests hang rather than erroring.
Root cause is postgres connection churn against the dev PGlite server, not
SSE frame buffering or OTel span growth. Two compounding problems:
- The dev-db PGlite socket server (apps/cloud/scripts/dev-db.ts) used the
library default maxConnections: 1. The cloud worker opens a fresh postgres
pool per request (the MCP auth seam rebuilds one on every /mcp request to
respect workerd's per-request I/O rule), so under concurrent load the
second-and-later connects were answered with "Too many connections" and an
immediate socket close. postgres.js then reconnected in a tight loop, piling
up thousands of half-closed sockets, starving real queries and driving
request latency into the tens of seconds until the stack hung. PGlite runs
queries serially anyway (its QueryQueueManager uses runExclusive), so
allowing many connections is safe: they queue instead of storming. Raise the
cap (overridable via DEV_DB_MAX_CONNECTIONS).
- The postgres pool finalizer (apps/cloud/src/db/db.ts) was fire-and-forget:
Effect.runFork(sql.end({ timeout: 0 })) returned before the socket was torn
down, so closing sockets accumulated faster than the server reaped them.
Await a graceful end() inside the acquireRelease scope so the number of
live-plus-closing sockets stays bounded to what is actually in flight.
Extracted as closePostgres() and covered by a unit test.
Measured on a booted cloud stack (3 concurrent MCP clients, 1s think-time):
baseline (maxConnections=1): ~18 sessions in 135s, avg 15s/session,
p95 up to 50s, latency climbing, stack stalling, thousands of
"Too many connections".
fixed: 347 sessions in 135s, avg ~125ms/session, p95 ~150ms, flat,
zero "Too many connections".
4d88370 to
df86ec5
Compare
Cloudflare previewTorn down — the PR is closed. |
@executor-js/cli
@executor-js/config
@executor-js/execution
@executor-js/sdk
@executor-js/codemode-core
@executor-js/runtime-quickjs
@executor-js/plugin-file-secrets
@executor-js/plugin-graphql
@executor-js/plugin-keychain
@executor-js/plugin-mcp
@executor-js/plugin-onepassword
@executor-js/plugin-openapi
executor
commit: |
With the dev-db now accepting concurrent connections, three e2e specs kept
failing individually (one per shard, through all retries): policies-ui,
mcp-catalog-sync-ui, and oauth-callback-org-scope. Their failure artifacts all
showed the same server-side signature: random 500s on /api/tools,
/api/connections, /api/oauth/clients ("Failed to load tools" in the UI), backed
by PostgresError: bind message supplies N parameters, but prepared statement ""
requires M.
Root cause: pglite-socket 0.1.4's QueryQueueManager enqueues each postgres wire
FRAME (Parse, Bind, Execute, Sync) as its own queue entry against the single
shared PGlite session. Two concurrent connections' extended-protocol pipelines
interleave: A's Parse (5 params), B's Parse (1 param), then A's Bind hits B's
unnamed statement. Whichever request loses the race 500s. A standalone repro
(6 clients alternating 1-param and 5-param unprepared queries, the exact
drizzle/postgres-js shape) corrupted 34.7% of queries.
Patch (patches/@electric-sql%2Fpglite-socket@0.1.4.patch):
- batch all frames parsed from one socket data event into ONE queue entry, so
a client's Parse..Sync executes atomically under runExclusive;
- add pipeline affinity: while a handler's batch ended mid-pipeline (last frame
not Sync/Query/Terminate), only that handler's queue entries are scheduled,
mirroring the existing transaction affinity; released on detach with a
synthesized Sync so an abandoned pipeline can't wedge the queue.
Same repro after the patch: 0% corrupted (106k queries clean). Regression test:
apps/cloud/src/db/dev-db-socket-concurrency.node.test.ts (fails on stock 0.1.4,
verified).
Also fixes the remaining race in cloud/oauth-callback-org-scope.test.ts, which
had a second, spec-local bug: it deep-linked to ?addAccount=1 while the session
cookie still pointed at org B, so the URL bounced (org A -> org B -> org A via
OrgSlugGate canonicalization) while the auto-opened modal fetched
/api/oauth/clients under the transient org-B scope, got an empty app list, and
left "Connect with OAuth" disabled forever. The spec now lands on the
integration page without the deep link, waits for org A's shell to settle,
opens the modal explicitly, and waits for the button to be enabled before
clicking. The pre-armed waitForRequest is also marked handled so a failed click
can no longer leak it as an unhandled rejection.
All three specs pass together locally against the patched stack.
A client that stalls mid-pipeline (Parse sent, no Sync) while keeping its socket open would hold the query queue's handler affinity forever and starve every other connection: affinity only releases on detach, and detach needs close/error/idle-timeout. The dev-db previously disabled the idle timeout (pglite-socket defaults to 0). Unreachable for postgres.js in practice (a crashed client closes its socket), but a 30s idle timeout (env-overridable via DEV_DB_IDLE_TIMEOUT_MS) makes the starvation impossible instead of merely unlikely. The timer resets on every data event, so only a dead client trips it.
The e2e harness claimed a port block by connect-probing each candidate and only binding the real services seconds later, a time-of-check/time-of-use gap. Worse, the 42000-45999 block range sits entirely inside Linux's default ephemeral range (32768-60999), so any outbound TCP the booting stack makes (DB dials, emulator warmups, telemetry) can be assigned a claimed port as its local port between probe and bind, EADDRINUSE-ing global setup before any test runs with no other suite involved. A connect-probe can't even see such a squatter since nothing is listening on it. Two layers: - claimPorts now BIND-probes each candidate (holding a listener on every port until the whole block binds, then closing them the instant before services take over) instead of connect-probing. Binds both 0.0.0.0 and :: so a same-port squatter on either family is caught (Node's default listen binds only ::, and an IPv4 listener coexists with it). This shrinks the race to microseconds and detects ephemeral-socket squatters. - A bind-probe can't fully close the window, so add claimAndBoot: a bounded (3-attempt) retry wrapping claim + boot that, on EADDRINUSE, releases the block and re-claims the next one before retrying. Non-EADDRINUSE boot failures rethrow immediately. All four server-booting global setups (cloud, selfhost, cloudflare, selfhost-docker) route through it. The 42000-45999 range is kept: it's fine for local macOS dev and moving it churns every consumer. The bind-probe plus retry are what make it safe on CI.
The bind-probe added for the port-claim race bound each candidate port on 0.0.0.0 and then on a plain ::. On Linux a plain :: listen is dual-stack (ipv6Only defaults to false), so it also claims the IPv4 side of the port and EADDRINUSEs against our own just-bound 0.0.0.0 probe. Every block therefore looked squatted, claimPorts walked all 400 blocks, and every e2e globalsetup died with "range is exhausted". macOS's BSD semantics let that pair coexist, which is why it passed local smoke testing. Probe shape now, verified empirically on both macOS and Linux (node/bun in a container): - Connect-probe 127.0.0.1 and ::1 first: on macOS a specific-address listener (a leaked vite dev server) coexists with a wildcard bind, so only a connect can see it. Runs before our own wildcard binds would answer the loopback. - Bind 0.0.0.0, then :: with ipv6Only: true. The v4 side and the v6-only side never overlap, so the pair coexists on both platforms while still catching squatters on either family, including ephemeral outbound sockets. Smoke-tested in a Linux container: a clean environment claims its preferred block with no walk (the shipped failure mode), service-shaped binds (dual-stack :: and 127.0.0.1) succeed after the probes release, and squatters on 0.0.0.0, dual-stack ::, and 127.0.0.1 are each detected.
The scenario landed on "/" with waitUntil networkidle and immediately read the
org slug from the pathname. networkidle proves the network went quiet, not
that the client-side canonicalization redirect ran: on a slow CI runner the
pathname is still "/" at that point, so split("/").filter(Boolean)[0] is
undefined and the test navigates to the literal /undefined/billing/plans. The
billing fetches fire under that bogus slug and cache their failure (autumn-js
staleTime 60s, no refetch on focus); the shell then canonicalizes the URL
client-side but nothing refetches, so the plans grid stays empty and the
"Start free trial" waitFor times out through every retry. The run artifacts
show exactly this: timeline.json records the /undefined/billing/plans
navigation and failure.png shows the plans page with an empty grid.
Wait for the URL to canonicalize onto the slug pattern before reading it, the
same barrier member-invite-seat-limit already uses.
Greptile SummaryThis PR fixes the CI cascade flake where the cloud dev stack degraded under sustained MCP load by addressing two compounding root causes: the dev PGlite server's
Confidence Score: 4/5The two core fixes (awaited pool finalizer and raised maxConnections) are targeted, well-tested, and safe to merge. The DB fixes and e2e race fixes are correct and well-characterized by tests. Three non-blocking issues were found: the concurrency regression test hardcodes a port inside the e2e claiming range, Number() coercion of DEV_DB_MAX_CONNECTIONS silently produces NaN on a bad value, and throw lastError after the claimAndBoot loop is unreachable dead code. None affect production paths. apps/cloud/src/db/dev-db-socket-concurrency.node.test.ts (hardcoded port 45998 within the e2e claiming range) and apps/cloud/scripts/dev-db.ts (NaN guard on env-var coercion). Important Files Changed
Reviews (1): Last reviewed commit: "Fix slug-read race in the billing trial-..." | Re-trigger Greptile |
| import { PGLiteSocketServer } from "@electric-sql/pglite-socket"; | ||
| import postgres from "postgres"; | ||
|
|
||
| const PORT = 45998; |
There was a problem hiding this comment.
Hardcoded port within e2e claiming range
PORT = 45998 maps to offset 8 of block 399 in the 42000–45999 e2e claiming range. If this test runs while the e2e globalsetup happens to walk to block 399 and claim that port, the server.listen() call on 127.0.0.1:45998 races with the e2e service binding. The new bindProbe would detect the unit-test server via the connect-probe and skip block 399, but the inverse — e2e claims the block first, releases the bind probes, and then the test tries to bind — would surface as a spurious test failure. Picking a port outside the claiming range (e.g. ≥46000 or <42000) removes the ambiguity entirely.
| ); | ||
| } | ||
| } | ||
| throw lastError; |
There was a problem hiding this comment.
The for loop's catch block always throws on the final attempt: !isAddrInUse(error) || attempt === maxAttempts is true on attempt maxAttempts regardless of error type. So the loop exits via throw inside the catch rather than by the loop condition, and the post-loop throw lastError is dead code. lastError could also theoretically be undefined if no iteration populates it (though that path is also unreachable). Removing this line makes the control flow explicit.
| maxConnections: Number(process.env.DEV_DB_MAX_CONNECTIONS ?? 1000), | ||
| // Backstop for pipeline affinity: a client that stalls mid-pipeline (Parse | ||
| // sent, no Sync) with its socket still OPEN would hold the queue's handler | ||
| // affinity forever and starve every other connection, since affinity only | ||
| // releases on detach and detach needs close/error/idle-timeout. In ms; the | ||
| // timer resets on every data event, so only a genuinely dead client trips it. | ||
| idleTimeout: Number(process.env.DEV_DB_IDLE_TIMEOUT_MS ?? 30_000), |
There was a problem hiding this comment.
Number() returns NaN for a non-numeric string, which causes this.handlers.size >= NaN to always be false, silently removing the connection cap. Using Number.isFinite with a fallback gives a clearly bounded result.
| maxConnections: Number(process.env.DEV_DB_MAX_CONNECTIONS ?? 1000), | |
| // Backstop for pipeline affinity: a client that stalls mid-pipeline (Parse | |
| // sent, no Sync) with its socket still OPEN would hold the queue's handler | |
| // affinity forever and starve every other connection, since affinity only | |
| // releases on detach and detach needs close/error/idle-timeout. In ms; the | |
| // timer resets on every data event, so only a genuinely dead client trips it. | |
| idleTimeout: Number(process.env.DEV_DB_IDLE_TIMEOUT_MS ?? 30_000), | |
| maxConnections: Number.isFinite(Number(process.env.DEV_DB_MAX_CONNECTIONS)) | |
| ? Number(process.env.DEV_DB_MAX_CONNECTIONS) | |
| : 1000, | |
| // Backstop for pipeline affinity: a client that stalls mid-pipeline (Parse | |
| // sent, no Sync) with its socket still OPEN would hold the queue's handler | |
| // affinity forever and starve every other connection, since affinity only | |
| // releases on detach and detach needs close/error/idle-timeout. In ms; the | |
| // timer resets on every data event, so only a genuinely dead client trips it. | |
| idleTimeout: Number.isFinite(Number(process.env.DEV_DB_IDLE_TIMEOUT_MS)) | |
| ? Number(process.env.DEV_DB_IDLE_TIMEOUT_MS) | |
| : 30_000, |
Summary
The cloud e2e dev stack degrades after a few minutes of sustained suite load and eventually hangs. This is the CI cascade flake: MCP-heavy shards (
toolkits-mcpand friends) run, then unrelated specs die with PlaywrightTimeoutErrors because requests hang rather than erroring. Nondeterministic, worse on 2-core runners.Root cause
The degradation is postgres connection churn against the dev PGlite server, not SSE frame buffering or OTel span growth. The prior work (SSE forwarder cap in agents@0.17.3, per-isolate mem counters) hardened real but separate paths; the remaining growth was in the DB layer. Two compounding problems:
apps/cloud/scripts/dev-db.tsused the defaultmaxConnections: 1.PGLiteSocketServerdefaults to one connection and answers every extra concurrent connection with"Too many connections"+ an immediate socket close. (pglite-socket 0.1.4's publishedindex.d.tsdocuments "default: 100", but the shipped runtime JS ismaxConnections ?? 1— verified in the shipped chunk.) The cloud worker opens a fresh postgres pool per request — the MCP auth seam (apps/cloud/src/mcp/auth.tsmakeMcpOrganizationAuthServices) rebuilds one on every/mcprequest to respect workerd's per-request I/O rule. Under concurrent load (exactly what the e2e suite generates against one shared stack), the second-and-later connects were rejected, andpostgres.jsreconnected in a tight loop against the closed socket. That reconnect storm piled up thousands of half-closed sockets, starved real queries, drove per-request latency into the tens of seconds, and hung the stack. PGlite runs queries serially anyway (itsQueryQueueManagerusesrunExclusive), so allowing many connections is safe — they queue instead of storming.The postgres pool finalizer in
apps/cloud/src/db/db.tswas fire-and-forget.Effect.runFork(sql.end({ timeout: 0 }))returned before the socket was torn down, so closing sockets accumulated faster than the server reaped them. Awaiting a gracefulend()inside theacquireReleasescope bounds the live-plus-closing socket count to what is actually in flight. (Runs in the request's own scope, so still respects workerd's per-request I/O rule.) Make the test suite parallel-safe and speed up CI #1273 on main independently made this finalizer awaited but kepttimeout: 0; this branch extends it with a graceful drain ceiling and extracts it as a testedclosePostgres().Why the SSE/OTel hypothesis was off: the cloud e2e stack's OTel
BatchSpanProcessoris bounded (maxQueueSize: 2048) and flushed per request; the SSE forwarder is already capped by PR #1257's patch. Neither grows unbounded. The unbounded resource was file descriptors / sockets against PGlite, driven by the per-request DB pool pattern hitting a single-connection dev server.Evidence
Reproduced on a booted cloud stack driven by sustained MCP session churn (connect → initialize → list-tools → close). Representative apples-to-apples run at a realistic e2e-like rate (3 concurrent clients, 1s think-time between sessions, ~135s):
maxConnections=1)"Too many connections"Baseline stalls (session count flat, latency climbing, oscillating hangs) — the CI symptom. Fixed is flat and sustained.
Changes
apps/cloud/scripts/dev-db.ts— setmaxConnectionson the dev-dbPGLiteSocketServer(default 1000, overridable viaDEV_DB_MAX_CONNECTIONS).apps/cloud/src/db/db.ts— extractclosePostgres()that awaits a gracefulsql.end()(short drain ceiling, nottimeout: 0); use it as the pool finalizer.apps/cloud/src/db/db.close.test.ts— new unit tests characterizing the finalizer contract the old fire-and-forget close violated: non-zero drain window, awaits teardown across microtask AND real wall-clock delays, and teardown errors don't fail the request scope. Fast, no live DB.Verification
bun run typecheck(cloud): clean.oxlinton changed files: clean.apps/cloud/src/db/*tests: 22 passed (incl. the new 4).Notes / unconfirmed
closePostgresis the finalizer forDbService.Live, which runs per-request on the prod API path too (viaRequestScopedServicesLive→requestScopedMiddleware'sEffect.scoped). The prod change is limited to awaiting the socket teardown that was previously abandoned — no functional change, and only a small bounded latency on scope close:end({ timeout })is aPromise.raceceiling, not a fixed wait, so an idle connection'send()terminates immediately and resolves on socket close (sub-ms); the 5s ceiling only bounds a connection still mid-query. ThemaxConnectionschange is dev/CI-stack only (prod uses Hyperdrive).StorageErrors under back-to-back sessions); this is well below the real e2e serial rate and no longer produces the storm/hang. Worth a follow-up if it shows up, but out of scope here.