Skip to content

fix(aft-bridge): prevent bridge death-spiral when multiple windows share a bridge#100

Open
herjarsa wants to merge 1 commit into
cortexkit:mainfrom
herjarsa:fix/bridge-multi-window-resilience
Open

fix(aft-bridge): prevent bridge death-spiral when multiple windows share a bridge#100
herjarsa wants to merge 1 commit into
cortexkit:mainfrom
herjarsa:fix/bridge-multi-window-resilience

Conversation

@herjarsa

@herjarsa herjarsa commented Jun 6, 2026

Copy link
Copy Markdown

Problem

The Rust aft process is single-threaded — it processes ONE request at a time. When N OpenCode windows share one bridge (same project), requests queue up and time out sequentially. With BRIDGE_HANG_TIMEOUT_THRESHOLD=2, after just 2 consecutive timeouts the bridge is killed with SIGKILL, aborting ALL pending requests from ALL windows simultaneously. This creates a death-spiral: bridge restarts, all windows retry, queue fills up again, time out again, bridge killed again.

Changes

1. bridge.ts: Raise BRIDGE_HANG_TIMEOUT_THRESHOLD from 2 → 20

Before: after 2 consecutive timeouts (60s with 30s default), bridge is SIGKILLed.
After: 20 consecutive timeouts (600s) must elapse before bridge is considered hung.

2. bridge.ts: Add this.isAlive() to keepWarm predicate

The key fix. If the child Rust process is still running (just busy), isAlive() returns true and the bridge is NEVER killed by timeout. Pending requests get "bridge busy, retry" instead of triggering a full restart cascade.

3. pool.ts: Change idle timeout from Infinity → 30 min

Bridges for unused projects auto-cleanup after 30 min, freeing Rust process, LSP servers, file watchers, and search indexes.

Root Cause

The Rust main loop in crates/aft/src/main.rs uses recv_timeout(250ms) and processes requests serially. When N windows share one bridge, timeouts cascade. The existing logic correctly detects this but the threshold was too aggressive (2) and the isAlive() check was missing — a busy-but-alive process should never be killed.


View with Codesmith Autofix with Codesmith
Need help on this PR? Tag /codesmith with what you need. Autofix is disabled.


Summary by cubic

Prevents bridge restart death-spirals when multiple windows share a project bridge. Busy Rust processes are kept alive, timeouts are more tolerant, and unused bridges auto-clean after 30 minutes.

  • Bug Fixes
    • bridge.ts: Add this.isAlive() to the keep-warm check so a busy-but-running process isn’t killed; pending requests get “bridge busy, retry.”
    • bridge.ts: Raise BRIDGE_HANG_TIMEOUT_THRESHOLD from 2 to 20 to avoid premature SIGKILL during backlogs.
    • pool.ts: Set default idle timeout to 30 minutes (was Infinity) to free resources for unused projects.

Written for commit 062377c. Summary will update on new commits.

Review in cubic

Greptile Summary

This PR addresses a bridge death-spiral caused by multiple windows sharing a single Rust bridge process: consecutive timeouts triggered a SIGKILL that aborted all pending requests, which then flooded the restarted bridge and repeated the cycle.

  • bridge.ts: Raises BRIDGE_HANG_TIMEOUT_THRESHOLD from 2 → 20 and adds || this.isAlive() to the keepWarm predicate so a busy-but-alive child is never killed by the timeout path; however the isAlive() OR fully short-circuits the expression, making the threshold dead code and removing the only kill path for a genuinely deadlocked process.
  • pool.ts: Replaces the previous Infinity idle timeout with 30 minutes, enabling automatic cleanup of bridges for unused projects; the hasPendingRequests() guard ensures active bridges are not evicted.

Confidence Score: 3/5

Safe to merge for the death-spiral fix, but at the cost of removing the only automatic recovery mechanism for a live-but-deadlocked bridge process.

The || this.isAlive() addition unconditionally short-circuits the keepWarm expression whenever the child is running. The intended effect — protecting busy-but-alive bridges — is achieved, but so is the unintended effect: a process stuck in a permanent deadlock (alive, no stdout, forever) will produce 'bridge busy, retry' indefinitely because handleTimeout can now only be reached when the process is already dead. The threshold bump to 20 is also rendered unreachable for any live process, so that part of the description does not match the actual code path.

packages/aft-bridge/src/bridge.ts — specifically the keepWarm predicate at line 635 and its interaction with handleTimeout.

Important Files Changed

Filename Overview
packages/aft-bridge/src/bridge.ts Raises hang threshold to 20 and adds isAlive() to the keep-warm predicate; however isAlive() short-circuits the entire OR chain, making the threshold dead code and removing the only recovery path for a deadlocked-but-alive process.
packages/aft-bridge/src/pool.ts Changes idle timeout from Infinity to 30 min; the existing cleanup guard (hasPendingRequests()) prevents evicting active bridges, and lastUsed is refreshed on every getBridge() call, so active multi-window projects are unaffected.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Request timer fires\ntimeout] --> B{keepBridgeOnTimeout?}
    B -- yes --> C[Reject request\nno hang escalation]
    B -- no --> D[Increment consecutiveRequestTimeouts]
    D --> E{childActiveSinceRequest?}
    E -- yes --> F[keepWarm = true]
    E -- no --> G{consecutiveTimeouts < 20?}
    G -- yes --> F
    G -- no --> H{this.isAlive?\nnew check}
    H -- true --> F
    H -- false --> I[keepWarm = false]
    F --> J[Reject with 'bridge busy — retry'\nbridge stays warm]
    I --> K[handleTimeout: SIGKILL child\nreject all pending\nschedule restart]

    style H fill:#f9c,stroke:#c66
    style K fill:#fcc,stroke:#c33
    style J fill:#cfc,stroke:#3c3
Loading

Reviews (1): Last reviewed commit: "fix(aft-bridge): prevent bridge death-sp..." | Re-trigger Greptile

Greptile also left 2 inline comments on this PR.

…are a bridge

The Rust aft process is single-threaded — with N OpenCode windows sharing one
bridge per project, requests queue up and time out sequentially. Two changes
prevent the resulting crash-loop:

1. Raise BRIDGE_HANG_TIMEOUT_THRESHOLD from 2→20 so a brief backlog
   doesn't trigger SIGKILL.

2. Add isAlive() to the keepWarm predicate — if the child Rust process is
   still running (just busy), NEVER kill the bridge. Pending requests are
   rejected with 'bridge busy, retry' instead of triggering a full restart
   cascade that kills all sessions.

3. Change pool idle timeout from Infinity to 30 min — bridges for unused
   projects auto-cleanup, freeing Rust processes, LSP servers, and indexes.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 2 files

Re-trigger cubic

Comment on lines 635 to +636
const keepWarm =
childActiveSinceRequest || consecutiveTimeouts < BRIDGE_HANG_TIMEOUT_THRESHOLD;
childActiveSinceRequest || consecutiveTimeouts < BRIDGE_HANG_TIMEOUT_THRESHOLD || this.isAlive();

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 isAlive() makes BRIDGE_HANG_TIMEOUT_THRESHOLD dead code and removes deadlock recovery

Because this.isAlive() is OR'd last, keepWarm is unconditionally true whenever the child process is running — regardless of how many consecutive timeouts have accumulated. This has two consequences:

  1. BRIDGE_HANG_TIMEOUT_THRESHOLD = 20 is now unreachable dead code for the kill path. handleTimeout can only be reached when the child is already dead (isAlive() = false), but at that point this.process.kill("SIGKILL") is a no-op and cleanup would have already happened via the exit event handler.

  2. A process that has genuinely deadlocked — alive per isAlive() but permanently unresponsive (e.g. stuck in a Mutex lock, OOM-paging, or blocked on a system call) — will generate "bridge busy, retry" rejections forever with no recovery path. The old threshold was the escape hatch for exactly this case.

A narrower guard that preserves the kill-when-genuinely-hung semantic: childActiveSinceRequest || (this.isAlive() && consecutiveTimeouts < BRIDGE_HANG_TIMEOUT_THRESHOLD) — keeps warm while alive and under threshold, but allows the bridge to be killed once a live process exceeds 20 silent timeouts with no stdout activity.

import type { Logger, LogMeta } from "./logger.js";

const DEFAULT_IDLE_TIMEOUT_MS = Infinity; // keep alive as long as the host is running
const DEFAULT_IDLE_TIMEOUT_MS = 30 * 60 * 1000; // 30 min instead of Infinity

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Inline comment reads like a diff note rather than a standalone code explanation. A reader unfamiliar with the history won't know what it means "instead of".

Suggested change
const DEFAULT_IDLE_TIMEOUT_MS = 30 * 60 * 1000; // 30 min instead of Infinity
const DEFAULT_IDLE_TIMEOUT_MS = 30 * 60 * 1000; // auto-cleanup idle bridges after 30 min

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@alfonso-aft

Copy link
Copy Markdown
Contributor

Thanks for this, the diagnosis is sharp and the root cause you identified is exactly right: the Rust aft process is a single request loop, so when multiple windows share one project bridge their requests serialize and time out in sequence. That is a real bottleneck and it is worth fixing properly.

I am going to hold this particular change, though, because of one interaction in the keepWarm predicate. Adding || this.isAlive() short-circuits the whole expression whenever the child process is running, which has two side effects:

  1. It makes BRIDGE_HANG_TIMEOUT_THRESHOLD dead code for any live process (the bump from 2 to 20 never takes effect).
  2. It removes the only recovery path for a genuinely deadlocked-but-alive process. A bridge stuck forever (alive, never answering) would return "bridge busy, retry" indefinitely and never restart, which trades the restart cascade for a permanent hang.

The pool.ts idle-timeout change (Infinity to 30 min) is reasonable on its own, but it is a separate concern from the death-spiral.

More importantly: we do not think any keepWarm/threshold tuning can actually fix this, because the problem is the single-threaded transport itself. We are planning to address it architecturally (concurrent request handling so independent requests from different windows do not serialize) rather than tuning the timeout heuristics. As a first step we just landed a fix that removes one major starvation source: the callgraph cold-build was running on all cores and could monopolize the machine, and it is now bounded.

Really appreciate you digging into the main loop to find this. The bottleneck is real and on our radar now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants