fix(aft-bridge): prevent bridge death-spiral when multiple windows share a bridge by herjarsa · Pull Request #100 · cortexkit/aft

herjarsa · 2026-06-06T13:24:44Z

Problem

The Rust aft process is single-threaded — it processes ONE request at a time. When N OpenCode windows share one bridge (same project), requests queue up and time out sequentially. With BRIDGE_HANG_TIMEOUT_THRESHOLD=2, after just 2 consecutive timeouts the bridge is killed with SIGKILL, aborting ALL pending requests from ALL windows simultaneously. This creates a death-spiral: bridge restarts, all windows retry, queue fills up again, time out again, bridge killed again.

Changes

1. `bridge.ts`: Raise `BRIDGE_HANG_TIMEOUT_THRESHOLD` from 2 → 20

Before: after 2 consecutive timeouts (60s with 30s default), bridge is SIGKILLed.
After: 20 consecutive timeouts (600s) must elapse before bridge is considered hung.

2. `bridge.ts`: Add `this.isAlive()` to keepWarm predicate

The key fix. If the child Rust process is still running (just busy), isAlive() returns true and the bridge is NEVER killed by timeout. Pending requests get "bridge busy, retry" instead of triggering a full restart cascade.

3. `pool.ts`: Change idle timeout from `Infinity` → 30 min

Bridges for unused projects auto-cleanup after 30 min, freeing Rust process, LSP servers, file watchers, and search indexes.

Root Cause

The Rust main loop in crates/aft/src/main.rs uses recv_timeout(250ms) and processes requests serially. When N windows share one bridge, timeouts cascade. The existing logic correctly detects this but the threshold was too aggressive (2) and the isAlive() check was missing — a busy-but-alive process should never be killed.

^{Need help on this PR? Tag /codesmith with what you need. Autofix is disabled.}

Summary by cubic

Prevents bridge restart death-spirals when multiple windows share a project bridge. Busy Rust processes are kept alive, timeouts are more tolerant, and unused bridges auto-clean after 30 minutes.

Bug Fixes
- bridge.ts: Add this.isAlive() to the keep-warm check so a busy-but-running process isn’t killed; pending requests get “bridge busy, retry.”
- bridge.ts: Raise BRIDGE_HANG_TIMEOUT_THRESHOLD from 2 to 20 to avoid premature SIGKILL during backlogs.
- pool.ts: Set default idle timeout to 30 minutes (was Infinity) to free resources for unused projects.

^{Written for commit 062377c. Summary will update on new commits.}

Greptile Summary

This PR addresses a bridge death-spiral caused by multiple windows sharing a single Rust bridge process: consecutive timeouts triggered a SIGKILL that aborted all pending requests, which then flooded the restarted bridge and repeated the cycle.

bridge.ts: Raises BRIDGE_HANG_TIMEOUT_THRESHOLD from 2 → 20 and adds || this.isAlive() to the keepWarm predicate so a busy-but-alive child is never killed by the timeout path; however the isAlive() OR fully short-circuits the expression, making the threshold dead code and removing the only kill path for a genuinely deadlocked process.
pool.ts: Replaces the previous Infinity idle timeout with 30 minutes, enabling automatic cleanup of bridges for unused projects; the hasPendingRequests() guard ensures active bridges are not evicted.

Confidence Score: 3/5

Safe to merge for the death-spiral fix, but at the cost of removing the only automatic recovery mechanism for a live-but-deadlocked bridge process.

The || this.isAlive() addition unconditionally short-circuits the keepWarm expression whenever the child is running. The intended effect — protecting busy-but-alive bridges — is achieved, but so is the unintended effect: a process stuck in a permanent deadlock (alive, no stdout, forever) will produce 'bridge busy, retry' indefinitely because handleTimeout can now only be reached when the process is already dead. The threshold bump to 20 is also rendered unreachable for any live process, so that part of the description does not match the actual code path.

packages/aft-bridge/src/bridge.ts — specifically the keepWarm predicate at line 635 and its interaction with handleTimeout.

Important Files Changed

Filename	Overview
packages/aft-bridge/src/bridge.ts	Raises hang threshold to 20 and adds `isAlive()` to the keep-warm predicate; however `isAlive()` short-circuits the entire OR chain, making the threshold dead code and removing the only recovery path for a deadlocked-but-alive process.
packages/aft-bridge/src/pool.ts	Changes idle timeout from `Infinity` to 30 min; the existing cleanup guard (`hasPendingRequests()`) prevents evicting active bridges, and `lastUsed` is refreshed on every `getBridge()` call, so active multi-window projects are unaffected.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Request timer fires\ntimeout] --> B{keepBridgeOnTimeout?}
    B -- yes --> C[Reject request\nno hang escalation]
    B -- no --> D[Increment consecutiveRequestTimeouts]
    D --> E{childActiveSinceRequest?}
    E -- yes --> F[keepWarm = true]
    E -- no --> G{consecutiveTimeouts < 20?}
    G -- yes --> F
    G -- no --> H{this.isAlive?\nnew check}
    H -- true --> F
    H -- false --> I[keepWarm = false]
    F --> J[Reject with 'bridge busy — retry'\nbridge stays warm]
    I --> K[handleTimeout: SIGKILL child\nreject all pending\nschedule restart]

    style H fill:#f9c,stroke:#c66
    style K fill:#fcc,stroke:#c33
    style J fill:#cfc,stroke:#3c3

_{Reviews (1): Last reviewed commit: "fix(aft-bridge): prevent bridge death-sp..." | Re-trigger Greptile}

Greptile also left 2 inline comments on this PR.

…are a bridge The Rust aft process is single-threaded — with N OpenCode windows sharing one bridge per project, requests queue up and time out sequentially. Two changes prevent the resulting crash-loop: 1. Raise BRIDGE_HANG_TIMEOUT_THRESHOLD from 2→20 so a brief backlog doesn't trigger SIGKILL. 2. Add isAlive() to the keepWarm predicate — if the child Rust process is still running (just busy), NEVER kill the bridge. Pending requests are rejected with 'bridge busy, retry' instead of triggering a full restart cascade that kills all sessions. 3. Change pool idle timeout from Infinity to 30 min — bridges for unused projects auto-cleanup, freeing Rust processes, LSP servers, and indexes.

cubic-dev-ai

No issues found across 2 files

_{Re-trigger cubic}

greptile-apps · 2026-06-06T13:27:59Z

          const keepWarm =
-            childActiveSinceRequest || consecutiveTimeouts < BRIDGE_HANG_TIMEOUT_THRESHOLD;
+            childActiveSinceRequest || consecutiveTimeouts < BRIDGE_HANG_TIMEOUT_THRESHOLD || this.isAlive();


isAlive() makes BRIDGE_HANG_TIMEOUT_THRESHOLD dead code and removes deadlock recovery

Because this.isAlive() is OR'd last, keepWarm is unconditionally true whenever the child process is running — regardless of how many consecutive timeouts have accumulated. This has two consequences:

BRIDGE_HANG_TIMEOUT_THRESHOLD = 20 is now unreachable dead code for the kill path. handleTimeout can only be reached when the child is already dead (isAlive() = false), but at that point this.process.kill("SIGKILL") is a no-op and cleanup would have already happened via the exit event handler.

A process that has genuinely deadlocked — alive per isAlive() but permanently unresponsive (e.g. stuck in a Mutex lock, OOM-paging, or blocked on a system call) — will generate "bridge busy, retry" rejections forever with no recovery path. The old threshold was the escape hatch for exactly this case.

A narrower guard that preserves the kill-when-genuinely-hung semantic: childActiveSinceRequest || (this.isAlive() && consecutiveTimeouts < BRIDGE_HANG_TIMEOUT_THRESHOLD) — keeps warm while alive and under threshold, but allows the bridge to be killed once a live process exceeds 20 silent timeouts with no stdout activity.

greptile-apps · 2026-06-06T13:28:00Z

 import type { Logger, LogMeta } from "./logger.js";

-const DEFAULT_IDLE_TIMEOUT_MS = Infinity; // keep alive as long as the host is running
+const DEFAULT_IDLE_TIMEOUT_MS = 30 * 60 * 1000; // 30 min instead of Infinity


Inline comment reads like a diff note rather than a standalone code explanation. A reader unfamiliar with the history won't know what it means "instead of".

Suggested change

const DEFAULT_IDLE_TIMEOUT_MS = 30 * 60 * 1000; // 30 min instead of Infinity

const DEFAULT_IDLE_TIMEOUT_MS = 30 * 60 * 1000; // auto-cleanup idle bridges after 30 min

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

alfonso-aft · 2026-06-07T07:20:19Z

Thanks for this, the diagnosis is sharp and the root cause you identified is exactly right: the Rust aft process is a single request loop, so when multiple windows share one project bridge their requests serialize and time out in sequence. That is a real bottleneck and it is worth fixing properly.

I am going to hold this particular change, though, because of one interaction in the keepWarm predicate. Adding || this.isAlive() short-circuits the whole expression whenever the child process is running, which has two side effects:

It makes BRIDGE_HANG_TIMEOUT_THRESHOLD dead code for any live process (the bump from 2 to 20 never takes effect).
It removes the only recovery path for a genuinely deadlocked-but-alive process. A bridge stuck forever (alive, never answering) would return "bridge busy, retry" indefinitely and never restart, which trades the restart cascade for a permanent hang.

The pool.ts idle-timeout change (Infinity to 30 min) is reasonable on its own, but it is a separate concern from the death-spiral.

More importantly: we do not think any keepWarm/threshold tuning can actually fix this, because the problem is the single-threaded transport itself. We are planning to address it architecturally (concurrent request handling so independent requests from different windows do not serialize) rather than tuning the timeout heuristics. As a first step we just landed a fix that removes one major starvation source: the callgraph cold-build was running on all cores and could monopolize the machine, and it is now bounded.

Really appreciate you digging into the main loop to find this. The bottleneck is real and on our radar now.

cubic-dev-ai Bot reviewed Jun 6, 2026

View reviewed changes

greptile-apps Bot reviewed Jun 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(aft-bridge): prevent bridge death-spiral when multiple windows share a bridge#100

fix(aft-bridge): prevent bridge death-spiral when multiple windows share a bridge#100
herjarsa wants to merge 1 commit into
cortexkit:mainfrom
herjarsa:fix/bridge-multi-window-resilience

herjarsa commented Jun 6, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

greptile-apps Bot Jun 6, 2026

Uh oh!

greptile-apps Bot Jun 6, 2026

Uh oh!

alfonso-aft commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	const DEFAULT_IDLE_TIMEOUT_MS = 30 * 60 * 1000; // 30 min instead of Infinity
	const DEFAULT_IDLE_TIMEOUT_MS = 30 * 60 * 1000; // auto-cleanup idle bridges after 30 min

Conversation

herjarsa commented Jun 6, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Changes

1. bridge.ts: Raise BRIDGE_HANG_TIMEOUT_THRESHOLD from 2 → 20

2. bridge.ts: Add this.isAlive() to keepWarm predicate

3. pool.ts: Change idle timeout from Infinity → 30 min

Root Cause

Summary by cubic

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Flowchart

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

alfonso-aft commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

herjarsa commented Jun 6, 2026 •

edited by greptile-apps Bot

Loading

1. `bridge.ts`: Raise `BRIDGE_HANG_TIMEOUT_THRESHOLD` from 2 → 20

2. `bridge.ts`: Add `this.isAlive()` to keepWarm predicate

3. `pool.ts`: Change idle timeout from `Infinity` → 30 min