feat(realtime): preflight schema-compatibility check on startup#4940
Conversation
The socket service authorizes every connection with a full-row query against the workflow table. When a deploy ships a realtime image whose compiled schema is ahead of/behind the live DB (e.g. a column dropped by a migration the image predates), that query fails on every request and silently breaks persistence — yet the process stays up and the shallow /health probe keeps returning 200, so the deploy looks healthy while serving nothing. Run one representative workflow query before listen(): a schema mismatch throws, propagates to the entrypoint, and the task exits non-zero and never goes healthy, so CodeDeploy auto-rolls-back instead of shifting traffic onto broken tasks. Schema-class errors (undefined column/table/function) fail fast; connection-class errors retry with backoff so a cold DB at boot does not flap. Runs once at startup, never on the per-probe LB health check, to avoid a DB blip mass- terminating the fleet (cascading failure).
|
The latest updates on your projects. Learn more about Vercel for GitHub. |
PR SummaryMedium Risk Overview
Vitest coverage exercises success, fail-fast mismatch (column/table and wrapped cause), retry-then-success, and exhausted retries. Reviewed by Cursor Bugbot for commit 8967649. Configure here. |
Greptile SummaryThis PR adds a boot-time schema-compatibility preflight check to the realtime socket server. It runs one representative Drizzle query against the
Confidence Score: 5/5Safe to merge — the preflight logic is correct, the retry loop no longer sleeps after the final attempt, and tests assert sleep call counts on both retry paths. The three files changed are entirely additive (a new module, its tests, and a single call-site). The core logic — schema-class vs. connection-class error classification, cause-chain walking, retry loop with break-before-sleep on the last attempt — is all sound and well-covered by tests. No existing behaviour is modified; the only risk is the preflight itself throwing unexpectedly, which is handled by the top-level No files require special attention. Important Files Changed
Sequence DiagramsequenceDiagram
participant Proc as Process
participant IO as Socket.IO + RoomMgr
participant PF as assertSchemaCompatibility
participant DB as PostgreSQL
Proc->>IO: createSocketIOServer + createRoomManager
Proc->>IO: io.on('connection', handler)
Proc->>PF: await assertSchemaCompatibility()
loop up to MAX_CONNECT_ATTEMPTS (5)
PF->>DB: SELECT col1,col2,...,colN FROM workflow LIMIT 1
alt Schema mismatch (42703 / 42P01 / 42883)
DB-->>PF: Error (SQLSTATE code)
PF-->>Proc: throw "incompatible with live database"
Proc-->>Proc: process.exit(1) → CodeDeploy rollback
else Connection error
DB-->>PF: Error (ECONNREFUSED / timeout)
PF->>PF: sleep(backoffWithJitter)
else Success
DB-->>PF: []
PF-->>Proc: return
end
end
alt All attempts exhausted
PF-->>Proc: throw "database unreachable"
Proc-->>Proc: process.exit(1) → CodeDeploy rollback
else Preflight passed
Proc->>Proc: httpServer.listen(PORT)
Proc->>Proc: register SIGINT/SIGTERM handlers
end
Reviews (2): Last reviewed commit: "fix(realtime): unwrap cause for schema c..." | Re-trigger Greptile |
…attempt - isSchemaMismatch now walks the error.cause chain — drizzle wraps the driver error, so the SQLSTATE often lives on the inner cause, not the outer throw. Without this a wrapped 42703/42P01 was retried 5x and mis-reported as "database unreachable" instead of failing fast. - No longer sleeps after the final failed attempt (~6-10s of dead wait that undermined the fail-fast contract); sleep now only happens between attempts. - Tests: assert sleep is called exactly 4 times on exhaustion, and add a wrapped-cause fail-fast case.
|
@greptile |
|
@cursor review |
There was a problem hiding this comment.
✅ Bugbot reviewed your changes and found no new issues!
Comment @cursor review or bugbot run to trigger another review on this PR
Reviewed by Cursor Bugbot for commit 8967649. Configure here.
Summary
assertSchemaCompatibility) that runs one representativeworkflowquery before the socket server accepts traffic/healthstill returns 200Context
This is the class of bug that just broke prod sockets: the production realtime task was on a pre-schema-change image still selecting
workflow.colorafter the column was dropped, so every permission check failed and blocks stopped persisting — but the service looked healthy. Staging was fine only because it happened to run a current image. This check turns that silent, latent failure into a loud startup failure that blocks the bad deploy.Type of Change
Testing
Unit tests added (5, passing): success, schema-mismatch fail-fast (undefined column + table), retry-then-succeed, retry-exhausted. Typecheck + lint clean.
Checklist