fix(aft-bridge): prevent bridge death-spiral when multiple windows share a bridge#100
fix(aft-bridge): prevent bridge death-spiral when multiple windows share a bridge#100herjarsa wants to merge 1 commit into
Conversation
…are a bridge The Rust aft process is single-threaded — with N OpenCode windows sharing one bridge per project, requests queue up and time out sequentially. Two changes prevent the resulting crash-loop: 1. Raise BRIDGE_HANG_TIMEOUT_THRESHOLD from 2→20 so a brief backlog doesn't trigger SIGKILL. 2. Add isAlive() to the keepWarm predicate — if the child Rust process is still running (just busy), NEVER kill the bridge. Pending requests are rejected with 'bridge busy, retry' instead of triggering a full restart cascade that kills all sessions. 3. Change pool idle timeout from Infinity to 30 min — bridges for unused projects auto-cleanup, freeing Rust processes, LSP servers, and indexes.
| const keepWarm = | ||
| childActiveSinceRequest || consecutiveTimeouts < BRIDGE_HANG_TIMEOUT_THRESHOLD; | ||
| childActiveSinceRequest || consecutiveTimeouts < BRIDGE_HANG_TIMEOUT_THRESHOLD || this.isAlive(); |
There was a problem hiding this comment.
isAlive() makes BRIDGE_HANG_TIMEOUT_THRESHOLD dead code and removes deadlock recovery
Because this.isAlive() is OR'd last, keepWarm is unconditionally true whenever the child process is running — regardless of how many consecutive timeouts have accumulated. This has two consequences:
-
BRIDGE_HANG_TIMEOUT_THRESHOLD = 20is now unreachable dead code for the kill path.handleTimeoutcan only be reached when the child is already dead (isAlive() = false), but at that pointthis.process.kill("SIGKILL")is a no-op and cleanup would have already happened via theexitevent handler. -
A process that has genuinely deadlocked — alive per
isAlive()but permanently unresponsive (e.g. stuck in a Mutex lock, OOM-paging, or blocked on a system call) — will generate "bridge busy, retry" rejections forever with no recovery path. The old threshold was the escape hatch for exactly this case.
A narrower guard that preserves the kill-when-genuinely-hung semantic: childActiveSinceRequest || (this.isAlive() && consecutiveTimeouts < BRIDGE_HANG_TIMEOUT_THRESHOLD) — keeps warm while alive and under threshold, but allows the bridge to be killed once a live process exceeds 20 silent timeouts with no stdout activity.
| import type { Logger, LogMeta } from "./logger.js"; | ||
|
|
||
| const DEFAULT_IDLE_TIMEOUT_MS = Infinity; // keep alive as long as the host is running | ||
| const DEFAULT_IDLE_TIMEOUT_MS = 30 * 60 * 1000; // 30 min instead of Infinity |
There was a problem hiding this comment.
Inline comment reads like a diff note rather than a standalone code explanation. A reader unfamiliar with the history won't know what it means "instead of".
| const DEFAULT_IDLE_TIMEOUT_MS = 30 * 60 * 1000; // 30 min instead of Infinity | |
| const DEFAULT_IDLE_TIMEOUT_MS = 30 * 60 * 1000; // auto-cleanup idle bridges after 30 min |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
|
Thanks for this, the diagnosis is sharp and the root cause you identified is exactly right: the Rust I am going to hold this particular change, though, because of one interaction in the
The More importantly: we do not think any Really appreciate you digging into the main loop to find this. The bottleneck is real and on our radar now. |
Problem
The Rust
aftprocess is single-threaded — it processes ONE request at a time. When N OpenCode windows share one bridge (same project), requests queue up and time out sequentially. WithBRIDGE_HANG_TIMEOUT_THRESHOLD=2, after just 2 consecutive timeouts the bridge is killed with SIGKILL, aborting ALL pending requests from ALL windows simultaneously. This creates a death-spiral: bridge restarts, all windows retry, queue fills up again, time out again, bridge killed again.Changes
1.
bridge.ts: RaiseBRIDGE_HANG_TIMEOUT_THRESHOLDfrom 2 → 20Before: after 2 consecutive timeouts (60s with 30s default), bridge is SIGKILLed.
After: 20 consecutive timeouts (600s) must elapse before bridge is considered hung.
2.
bridge.ts: Addthis.isAlive()to keepWarm predicateThe key fix. If the child Rust process is still running (just busy),
isAlive()returns true and the bridge is NEVER killed by timeout. Pending requests get "bridge busy, retry" instead of triggering a full restart cascade.3.
pool.ts: Change idle timeout fromInfinity→ 30 minBridges for unused projects auto-cleanup after 30 min, freeing Rust process, LSP servers, file watchers, and search indexes.
Root Cause
The Rust main loop in
crates/aft/src/main.rsusesrecv_timeout(250ms)and processes requests serially. When N windows share one bridge, timeouts cascade. The existing logic correctly detects this but the threshold was too aggressive (2) and theisAlive()check was missing — a busy-but-alive process should never be killed.Need help on this PR? Tag
/codesmithwith what you need. Autofix is disabled.Summary by cubic
Prevents bridge restart death-spirals when multiple windows share a project bridge. Busy Rust processes are kept alive, timeouts are more tolerant, and unused bridges auto-clean after 30 minutes.
bridge.ts: Addthis.isAlive()to the keep-warm check so a busy-but-running process isn’t killed; pending requests get “bridge busy, retry.”bridge.ts: RaiseBRIDGE_HANG_TIMEOUT_THRESHOLDfrom 2 to 20 to avoid premature SIGKILL during backlogs.pool.ts: Set default idle timeout to 30 minutes (wasInfinity) to free resources for unused projects.Written for commit 062377c. Summary will update on new commits.
Greptile Summary
This PR addresses a bridge death-spiral caused by multiple windows sharing a single Rust bridge process: consecutive timeouts triggered a SIGKILL that aborted all pending requests, which then flooded the restarted bridge and repeated the cycle.
bridge.ts: RaisesBRIDGE_HANG_TIMEOUT_THRESHOLDfrom 2 → 20 and adds|| this.isAlive()to thekeepWarmpredicate so a busy-but-alive child is never killed by the timeout path; however theisAlive()OR fully short-circuits the expression, making the threshold dead code and removing the only kill path for a genuinely deadlocked process.pool.ts: Replaces the previousInfinityidle timeout with 30 minutes, enabling automatic cleanup of bridges for unused projects; thehasPendingRequests()guard ensures active bridges are not evicted.Confidence Score: 3/5
Safe to merge for the death-spiral fix, but at the cost of removing the only automatic recovery mechanism for a live-but-deadlocked bridge process.
The
|| this.isAlive()addition unconditionally short-circuits thekeepWarmexpression whenever the child is running. The intended effect — protecting busy-but-alive bridges — is achieved, but so is the unintended effect: a process stuck in a permanent deadlock (alive, no stdout, forever) will produce 'bridge busy, retry' indefinitely becausehandleTimeoutcan now only be reached when the process is already dead. The threshold bump to 20 is also rendered unreachable for any live process, so that part of the description does not match the actual code path.packages/aft-bridge/src/bridge.ts — specifically the
keepWarmpredicate at line 635 and its interaction withhandleTimeout.Important Files Changed
isAlive()to the keep-warm predicate; howeverisAlive()short-circuits the entire OR chain, making the threshold dead code and removing the only recovery path for a deadlocked-but-alive process.Infinityto 30 min; the existing cleanup guard (hasPendingRequests()) prevents evicting active bridges, andlastUsedis refreshed on everygetBridge()call, so active multi-window projects are unaffected.Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A[Request timer fires\ntimeout] --> B{keepBridgeOnTimeout?} B -- yes --> C[Reject request\nno hang escalation] B -- no --> D[Increment consecutiveRequestTimeouts] D --> E{childActiveSinceRequest?} E -- yes --> F[keepWarm = true] E -- no --> G{consecutiveTimeouts < 20?} G -- yes --> F G -- no --> H{this.isAlive?\nnew check} H -- true --> F H -- false --> I[keepWarm = false] F --> J[Reject with 'bridge busy — retry'\nbridge stays warm] I --> K[handleTimeout: SIGKILL child\nreject all pending\nschedule restart] style H fill:#f9c,stroke:#c66 style K fill:#fcc,stroke:#c33 style J fill:#cfc,stroke:#3c3Reviews (1): Last reviewed commit: "fix(aft-bridge): prevent bridge death-sp..." | Re-trigger Greptile