feat(realtime): preflight schema-compatibility check on startup by waleedlatif1 · Pull Request #4940 · simstudioai/sim

waleedlatif1 · 2026-06-10T04:49:12Z

Summary

Add a boot-time preflight check (assertSchemaCompatibility) that runs one representative workflow query before the socket server accepts traffic
A realtime image whose compiled schema is incompatible with the live DB (e.g. a column dropped by a migration the image predates) now fails fast at startup → CodeDeploy auto-rolls-back, instead of silently breaking workflow persistence while /health still returns 200
Schema-class errors (undefined column/table/function) fail fast; connection-class errors retry with backoff so a cold DB at boot doesn't flap
Runs once at startup, never on the per-probe LB health check — a deep dependency check on every probe would let a transient DB blip mass-terminate the fleet (cascading failure)

Context

This is the class of bug that just broke prod sockets: the production realtime task was on a pre-schema-change image still selecting workflow.color after the column was dropped, so every permission check failed and blocks stopped persisting — but the service looked healthy. Staging was fine only because it happened to run a current image. This check turns that silent, latent failure into a loud startup failure that blocks the bad deploy.

Type of Change

New feature (deploy-safety guard)

Testing

Unit tests added (5, passing): success, schema-mismatch fail-fast (undefined column + table), retry-then-succeed, retry-exhausted. Typecheck + lint clean.

Checklist

Code follows project style guidelines
Self-reviewed my changes
Tests added/updated and passing
No new warnings introduced
I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

The socket service authorizes every connection with a full-row query against the workflow table. When a deploy ships a realtime image whose compiled schema is ahead of/behind the live DB (e.g. a column dropped by a migration the image predates), that query fails on every request and silently breaks persistence — yet the process stays up and the shallow /health probe keeps returning 200, so the deploy looks healthy while serving nothing. Run one representative workflow query before listen(): a schema mismatch throws, propagates to the entrypoint, and the task exits non-zero and never goes healthy, so CodeDeploy auto-rolls-back instead of shifting traffic onto broken tasks. Schema-class errors (undefined column/table/function) fail fast; connection-class errors retry with backoff so a cold DB at boot does not flap. Runs once at startup, never on the per-probe LB health check, to avoid a DB blip mass- terminating the fleet (cascading failure).

vercel · 2026-06-10T04:49:17Z

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment

Project	Deployment	Actions	Updated (UTC)
docs	Skipped		Jun 10, 2026 4:56am

cursor · 2026-06-10T04:49:21Z

PR Summary

Medium Risk
Changes realtime startup sequencing: a failed preflight or prolonged DB outage at boot blocks the process from listening, which is intended for deploy safety but can delay or fail cold starts.

Overview
Adds a boot-time schema preflight for the realtime Socket.IO service so a deploy whose Drizzle schema does not match the live Postgres database fails before the server listens, instead of staying “healthy” on /health while workflow auth/persistence queries fail on every connection.

assertSchemaCompatibility runs a single representative workflow select … limit 1 once at startup (wired in index.ts immediately before httpServer.listen). Schema mismatch Postgres SQLSTATEs (42703, 42P01, 42883), including codes nested on error.cause, throw immediately with an explicit incompatible-schema message. Transient reachability errors retry up to five times with jittered backoff, then exit with a database-unreachable error. The check is intentionally not hooked to LB health probes to avoid fleet-wide restarts on short DB blips.

Vitest coverage exercises success, fail-fast mismatch (column/table and wrapped cause), retry-then-success, and exhausted retries.

^{Reviewed by Cursor Bugbot for commit 8967649. Configure here.}

greptile-apps · 2026-06-10T04:52:26Z

Greptile Summary

This PR adds a boot-time schema-compatibility preflight check to the realtime socket server. It runs one representative Drizzle query against the workflow table before the HTTP server accepts traffic; a schema-class error (SQLSTATE 42703/42P01/42883) causes an immediate startup failure and CodeDeploy rollback, while connection-class errors are retried with exponential backoff up to five attempts.

preflight.ts — new assertSchemaCompatibility function; isSchemaMismatch walks the full cause chain to detect Drizzle-wrapped driver errors, and the retry loop correctly breaks before sleeping on the final attempt.
preflight.test.ts — five unit tests covering success, two schema-mismatch variants, cause-wrapped mismatch, retry-then-succeed, and exhausted retries (including sleep-call-count assertions).
index.ts — single call to assertSchemaCompatibility() inserted after all in-process setup but before httpServer.listen(), so no traffic can be accepted on a bad image.

Confidence Score: 5/5

Safe to merge — the preflight logic is correct, the retry loop no longer sleeps after the final attempt, and tests assert sleep call counts on both retry paths.

The three files changed are entirely additive (a new module, its tests, and a single call-site). The core logic — schema-class vs. connection-class error classification, cause-chain walking, retry loop with break-before-sleep on the last attempt — is all sound and well-covered by tests. No existing behaviour is modified; the only risk is the preflight itself throwing unexpectedly, which is handled by the top-level .catch that exits with code 1.

No files require special attention.

Important Files Changed

Filename	Overview
apps/realtime/src/database/preflight.ts	New preflight module; cause-chain walk, schema vs. connection error classification, and retry loop with backoff are all correctly implemented. Sleep-before-last-attempt issue from a prior review is resolved.
apps/realtime/src/database/preflight.test.ts	Five tests cover the full decision tree; sleep call counts are asserted on both retry paths. Minor: the undefined-table test doesn't assert `sleep` not called, but other mismatch tests already cover that invariant.
apps/realtime/src/index.ts	Preflight call is placed correctly — after full in-process setup, before `httpServer.listen()`. SIGTERM/SIGINT handlers are still registered after `listen`, leaving a gap during the preflight window (flagged in a prior review, not addressed here).

Sequence Diagram

sequenceDiagram
    participant Proc as Process
    participant IO as Socket.IO + RoomMgr
    participant PF as assertSchemaCompatibility
    participant DB as PostgreSQL

    Proc->>IO: createSocketIOServer + createRoomManager
    Proc->>IO: io.on('connection', handler)
    Proc->>PF: await assertSchemaCompatibility()

    loop up to MAX_CONNECT_ATTEMPTS (5)
        PF->>DB: SELECT col1,col2,...,colN FROM workflow LIMIT 1
        alt Schema mismatch (42703 / 42P01 / 42883)
            DB-->>PF: Error (SQLSTATE code)
            PF-->>Proc: throw "incompatible with live database"
            Proc-->>Proc: process.exit(1) → CodeDeploy rollback
        else Connection error
            DB-->>PF: Error (ECONNREFUSED / timeout)
            PF->>PF: sleep(backoffWithJitter)
        else Success
            DB-->>PF: []
            PF-->>Proc: return
        end
    end

    alt All attempts exhausted
        PF-->>Proc: throw "database unreachable"
        Proc-->>Proc: process.exit(1) → CodeDeploy rollback
    else Preflight passed
        Proc->>Proc: httpServer.listen(PORT)
        Proc->>Proc: register SIGINT/SIGTERM handlers
    end

_{Reviews (2): Last reviewed commit: "fix(realtime): unwrap cause for schema c..." | Re-trigger Greptile}

…attempt - isSchemaMismatch now walks the error.cause chain — drizzle wraps the driver error, so the SQLSTATE often lives on the inner cause, not the outer throw. Without this a wrapped 42703/42P01 was retried 5x and mis-reported as "database unreachable" instead of failing fast. - No longer sleeps after the final failed attempt (~6-10s of dead wait that undermined the fail-fast contract); sleep now only happens between attempts. - Tests: assert sleep is called exactly 4 times on exhaustion, and add a wrapped-cause fail-fast case.

waleedlatif1 · 2026-06-10T04:56:57Z

@greptile

waleedlatif1 · 2026-06-10T04:57:08Z

@cursor review

cursor

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

^{Reviewed by Cursor Bugbot for commit 8967649. Configure here.}

cursor Bot reviewed Jun 10, 2026

View reviewed changes

Comment thread apps/realtime/src/database/preflight.ts

greptile-apps Bot reviewed Jun 10, 2026

View reviewed changes

Comment thread apps/realtime/src/database/preflight.ts

Comment thread apps/realtime/src/database/preflight.test.ts

vercel Bot temporarily deployed to Preview June 10, 2026 04:56 Inactive

cursor Bot reviewed Jun 10, 2026

View reviewed changes

waleedlatif1 merged commit 272bad9 into staging Jun 10, 2026
14 checks passed

waleedlatif1 deleted the fix/realtime-schema-startup-check branch June 10, 2026 05:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(realtime): preflight schema-compatibility check on startup#4940

feat(realtime): preflight schema-compatibility check on startup#4940
waleedlatif1 merged 2 commits into
stagingfrom
fix/realtime-schema-startup-check

waleedlatif1 commented Jun 10, 2026

Uh oh!

vercel Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

cursor Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

Uh oh!

greptile-apps Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

waleedlatif1 commented Jun 10, 2026

Uh oh!

waleedlatif1 commented Jun 10, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

waleedlatif1 commented Jun 10, 2026

Summary

Context

Type of Change

Testing

Checklist

Uh oh!

vercel Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

Uh oh!

greptile-apps Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

waleedlatif1 commented Jun 10, 2026

Uh oh!

waleedlatif1 commented Jun 10, 2026

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel Bot commented Jun 10, 2026 •

edited

Loading

cursor Bot commented Jun 10, 2026 •

edited

Loading

greptile-apps Bot commented Jun 10, 2026 •

edited

Loading