Skip to content

feat(realtime): preflight schema-compatibility check on startup#4940

Merged
waleedlatif1 merged 2 commits into
stagingfrom
fix/realtime-schema-startup-check
Jun 10, 2026
Merged

feat(realtime): preflight schema-compatibility check on startup#4940
waleedlatif1 merged 2 commits into
stagingfrom
fix/realtime-schema-startup-check

Conversation

@waleedlatif1

Copy link
Copy Markdown
Collaborator

Summary

  • Add a boot-time preflight check (assertSchemaCompatibility) that runs one representative workflow query before the socket server accepts traffic
  • A realtime image whose compiled schema is incompatible with the live DB (e.g. a column dropped by a migration the image predates) now fails fast at startup → CodeDeploy auto-rolls-back, instead of silently breaking workflow persistence while /health still returns 200
  • Schema-class errors (undefined column/table/function) fail fast; connection-class errors retry with backoff so a cold DB at boot doesn't flap
  • Runs once at startup, never on the per-probe LB health check — a deep dependency check on every probe would let a transient DB blip mass-terminate the fleet (cascading failure)

Context

This is the class of bug that just broke prod sockets: the production realtime task was on a pre-schema-change image still selecting workflow.color after the column was dropped, so every permission check failed and blocks stopped persisting — but the service looked healthy. Staging was fine only because it happened to run a current image. This check turns that silent, latent failure into a loud startup failure that blocks the bad deploy.

Type of Change

  • New feature (deploy-safety guard)

Testing

Unit tests added (5, passing): success, schema-mismatch fail-fast (undefined column + table), retry-then-succeed, retry-exhausted. Typecheck + lint clean.

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

The socket service authorizes every connection with a full-row query against
the workflow table. When a deploy ships a realtime image whose compiled schema
is ahead of/behind the live DB (e.g. a column dropped by a migration the image
predates), that query fails on every request and silently breaks persistence —
yet the process stays up and the shallow /health probe keeps returning 200, so
the deploy looks healthy while serving nothing.

Run one representative workflow query before listen(): a schema mismatch throws,
propagates to the entrypoint, and the task exits non-zero and never goes healthy,
so CodeDeploy auto-rolls-back instead of shifting traffic onto broken tasks.

Schema-class errors (undefined column/table/function) fail fast; connection-class
errors retry with backoff so a cold DB at boot does not flap. Runs once at
startup, never on the per-probe LB health check, to avoid a DB blip mass-
terminating the fleet (cascading failure).
@vercel

vercel Bot commented Jun 10, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped Jun 10, 2026 4:56am

Request Review

@cursor

cursor Bot commented Jun 10, 2026

Copy link
Copy Markdown

PR Summary

Medium Risk
Changes realtime startup sequencing: a failed preflight or prolonged DB outage at boot blocks the process from listening, which is intended for deploy safety but can delay or fail cold starts.

Overview
Adds a boot-time schema preflight for the realtime Socket.IO service so a deploy whose Drizzle schema does not match the live Postgres database fails before the server listens, instead of staying “healthy” on /health while workflow auth/persistence queries fail on every connection.

assertSchemaCompatibility runs a single representative workflow select … limit 1 once at startup (wired in index.ts immediately before httpServer.listen). Schema mismatch Postgres SQLSTATEs (42703, 42P01, 42883), including codes nested on error.cause, throw immediately with an explicit incompatible-schema message. Transient reachability errors retry up to five times with jittered backoff, then exit with a database-unreachable error. The check is intentionally not hooked to LB health probes to avoid fleet-wide restarts on short DB blips.

Vitest coverage exercises success, fail-fast mismatch (column/table and wrapped cause), retry-then-success, and exhausted retries.

Reviewed by Cursor Bugbot for commit 8967649. Configure here.

Comment thread apps/realtime/src/database/preflight.ts
@greptile-apps

greptile-apps Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds a boot-time schema-compatibility preflight check to the realtime socket server. It runs one representative Drizzle query against the workflow table before the HTTP server accepts traffic; a schema-class error (SQLSTATE 42703/42P01/42883) causes an immediate startup failure and CodeDeploy rollback, while connection-class errors are retried with exponential backoff up to five attempts.

  • preflight.ts — new assertSchemaCompatibility function; isSchemaMismatch walks the full cause chain to detect Drizzle-wrapped driver errors, and the retry loop correctly breaks before sleeping on the final attempt.
  • preflight.test.ts — five unit tests covering success, two schema-mismatch variants, cause-wrapped mismatch, retry-then-succeed, and exhausted retries (including sleep-call-count assertions).
  • index.ts — single call to assertSchemaCompatibility() inserted after all in-process setup but before httpServer.listen(), so no traffic can be accepted on a bad image.

Confidence Score: 5/5

Safe to merge — the preflight logic is correct, the retry loop no longer sleeps after the final attempt, and tests assert sleep call counts on both retry paths.

The three files changed are entirely additive (a new module, its tests, and a single call-site). The core logic — schema-class vs. connection-class error classification, cause-chain walking, retry loop with break-before-sleep on the last attempt — is all sound and well-covered by tests. No existing behaviour is modified; the only risk is the preflight itself throwing unexpectedly, which is handled by the top-level .catch that exits with code 1.

No files require special attention.

Important Files Changed

Filename Overview
apps/realtime/src/database/preflight.ts New preflight module; cause-chain walk, schema vs. connection error classification, and retry loop with backoff are all correctly implemented. Sleep-before-last-attempt issue from a prior review is resolved.
apps/realtime/src/database/preflight.test.ts Five tests cover the full decision tree; sleep call counts are asserted on both retry paths. Minor: the undefined-table test doesn't assert sleep not called, but other mismatch tests already cover that invariant.
apps/realtime/src/index.ts Preflight call is placed correctly — after full in-process setup, before httpServer.listen(). SIGTERM/SIGINT handlers are still registered after listen, leaving a gap during the preflight window (flagged in a prior review, not addressed here).

Sequence Diagram

sequenceDiagram
    participant Proc as Process
    participant IO as Socket.IO + RoomMgr
    participant PF as assertSchemaCompatibility
    participant DB as PostgreSQL

    Proc->>IO: createSocketIOServer + createRoomManager
    Proc->>IO: io.on('connection', handler)
    Proc->>PF: await assertSchemaCompatibility()

    loop up to MAX_CONNECT_ATTEMPTS (5)
        PF->>DB: SELECT col1,col2,...,colN FROM workflow LIMIT 1
        alt Schema mismatch (42703 / 42P01 / 42883)
            DB-->>PF: Error (SQLSTATE code)
            PF-->>Proc: throw "incompatible with live database"
            Proc-->>Proc: process.exit(1) → CodeDeploy rollback
        else Connection error
            DB-->>PF: Error (ECONNREFUSED / timeout)
            PF->>PF: sleep(backoffWithJitter)
        else Success
            DB-->>PF: []
            PF-->>Proc: return
        end
    end

    alt All attempts exhausted
        PF-->>Proc: throw "database unreachable"
        Proc-->>Proc: process.exit(1) → CodeDeploy rollback
    else Preflight passed
        Proc->>Proc: httpServer.listen(PORT)
        Proc->>Proc: register SIGINT/SIGTERM handlers
    end
Loading

Reviews (2): Last reviewed commit: "fix(realtime): unwrap cause for schema c..." | Re-trigger Greptile

Comment thread apps/realtime/src/database/preflight.ts
Comment thread apps/realtime/src/database/preflight.test.ts
…attempt

- isSchemaMismatch now walks the error.cause chain — drizzle wraps the driver
  error, so the SQLSTATE often lives on the inner cause, not the outer throw.
  Without this a wrapped 42703/42P01 was retried 5x and mis-reported as
  "database unreachable" instead of failing fast.
- No longer sleeps after the final failed attempt (~6-10s of dead wait that
  undermined the fail-fast contract); sleep now only happens between attempts.
- Tests: assert sleep is called exactly 4 times on exhaustion, and add a
  wrapped-cause fail-fast case.
@waleedlatif1

Copy link
Copy Markdown
Collaborator Author

@greptile

@waleedlatif1

Copy link
Copy Markdown
Collaborator Author

@cursor review

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit 8967649. Configure here.

@waleedlatif1 waleedlatif1 merged commit 272bad9 into staging Jun 10, 2026
14 checks passed
@waleedlatif1 waleedlatif1 deleted the fix/realtime-schema-startup-check branch June 10, 2026 05:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant