Skip to content

FE-872: Install-failure classification — infra-vs-test split#218

Open
kostandinang wants to merge 6 commits into
ka/fe-871-brunch-detectfrom
ka/fe-872-dep-install-classification
Open

FE-872: Install-failure classification — infra-vs-test split#218
kostandinang wants to merge 6 commits into
ka/fe-871-brunch-detectfrom
ka/fe-872-dep-install-classification

Conversation

@kostandinang

@kostandinang kostandinang commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Stacks on FE-871. Third Arc-1 frontier -- the FE-843-deferred fail/infra test-outcome split. No install verb: the install action stays agent-native (bash + FE-843 testConventions, A98).

What?

Make a broken toolchain distinguishable from a failing test, report it honestly, pin the promoted tree as reproducible, and make that classification visible everywhere tests run.

  • Slice 1 -- classify (TestResult.failureKind, classifyTestFailure): a failed run carries 'infra' | 'test'. Classified conservatively -- only an unambiguous "runner isn't there" signal is infra (spawn ENOENT, shell command not found). Everything else, incl. a missing module, stays test (ambiguous with a legitimate TDD red -- mislabeling would silently skip a real failure). The tests-run report surfaces an aggregate failureKind.
  • Slice 2 -- react (net-compiler.ts): when a slice exhausts retries because tests never ran, the halt reads toolchain/install failure (tests never ran in N attempts) instead of the misdirecting retry exhaustion.
  • Slice 3 -- greenfield dep capture (promote-run.test.ts): the agent's manifest + lockfile are pinned as a promotion invariant via git ls-files -- turning promoteGreenfieldRun's incidental copy into an asserted reproducible-tree guarantee.
  • Slice 4 -- unify the seam (runVerification): collapse the three diverged test-execution paths onto one TestRunner + one verdict helper, so failureKind is visible at every site, not just the net path.

Why?

TestResult was a single { passed } boolean, so a failed npm install looked identical to a logic bug -- sending the code-writer to "fix the code" while the toolchain never installed, then halting with a cause that named the wrong thing. And without a pinned capture invariant, a promoted tree could silently drop the lockfile. app-runtime-probe / integration-oracle depend on both: infra separated from test, and deps reproducible in the promoted tree.

Slice 4 closes the gap that slices 1-2 left: classification was only half-wired. evaluate-done and verify-epic ran tests through a private, spawn-based runTest that returned a bare boolean, so the new failureKind never reached them -- only the net run-tests path saw it. The duplicate execution path had also drifted (spawn vs spawnSync). Now there is one execution seam (TestRunner.run) and one verification seam (runVerification) owning the >=1-and-all-pass verdict rule and the infra-dominates aggregate; a runner that throws is treated as infra. pi-actions.runTest and evaluateVerificationTargets are deleted.

Scope discipline

Not a bespoke re-install arc -- the loop already retries and the agent re-installs via bash on its next turn. The harness owns only what bash install can't give: the classification, the honest terminal cause, the capture invariant, and one path that carries them.

Deferred

Brownfield dep-delta capture over the CoW baseline is blocked on brownfield-promotion (no brownfield promote path yet).

Co-authored-by: Amp amp@ampcode.com

@kostandinang kostandinang changed the title FE-872: classify test-run failures as infra vs test (slice 1) FE-872: Install-failure classification — infra-vs-test split + honest halt reason Jun 15, 2026
@kostandinang kostandinang changed the title FE-872: Install-failure classification — infra-vs-test split + honest halt reason FE-872: Install-failure classification — infra-vs-test split, honest halt, unified runner seam Jun 16, 2026
@kostandinang kostandinang force-pushed the ka/fe-872-dep-install-classification branch from a87e942 to ff041c6 Compare June 16, 2026 10:00
@kostandinang kostandinang changed the title FE-872: Install-failure classification — infra-vs-test split, honest halt, unified runner seam FE-872: Install-failure classification — infra-vs-test split Jun 16, 2026
@kostandinang kostandinang force-pushed the ka/fe-871-brunch-detect branch from c949e53 to a1afc98 Compare June 16, 2026 12:53
@kostandinang kostandinang force-pushed the ka/fe-872-dep-install-classification branch 3 times, most recently from 2cec280 to fecc5c6 Compare June 16, 2026 13:38
@kostandinang kostandinang marked this pull request as ready for review June 16, 2026 14:18
@cursor

cursor Bot commented Jun 16, 2026

Copy link
Copy Markdown

PR Summary

Medium Risk
Changes cook-loop halt semantics and shared test execution paths; behavior is heavily unit/integration tested and classification is intentionally conservative to avoid skipping real test failures.

Overview
FE-872 separates a broken toolchain from a failing test so the cook loop stops blaming “retry exhaustion” when tests never ran, and promotion can assert a reproducible dependency tree.

Failed runs now carry failureKind: 'infra' | 'test' via classifyTestFailure (missing runner / command not foundinfra; missing modules and assertion failures stay test so TDD reds are not skipped). ToolchainTestRunner stamps that on failures; tests-run, eval-done, and epic-verified reports include an aggregate failureKind where infra dominates.

runVerification replaces the duplicated spawn-based paths in pi-actions (evaluateVerificationTargets / private runTest removed). The Petri net run-tests transition, evaluate-done, and verify-epic all use the same TestRunner injection (wired from cook-cli). When retries exhaust with failureKind === 'infra', the slice halt reason names toolchain/install failure (tests never ran…) instead of generic retry exhaustion.

Greenfield promotion gains a test that package.json and lockfile are tracked in git after promoteGreenfieldRun. memory/PLAN.md and SPEC.md record FE-872 progress, dogfood-spike outcomes, and partial validation of assumption A98.

Reviewed by Cursor Bugbot for commit c902500. Bugbot is set up for automated code reviews on this repo. Configure here.

Comment thread src/orchestrator/src/test-runner.ts Outdated
kostandinang and others added 6 commits June 17, 2026 00:43
TestResult gains a failureKind?: 'infra' | 'test' discriminant so a
broken toolchain (missing runner binary / deps never installed) is no
longer indistinguishable from a logic failure that should send the
code-writer to fix the code.

ToolchainTestRunner.run classifies a failed run via classifyTestFailure,
deliberately conservative: only an unambiguous "the runner itself isn't
there" signal (spawn ENOENT, or a shell command-not-found) is infra;
everything else is test, because a missing module is ambiguous with a
legitimate TDD red and mislabeling a real failure as infra would
silently skip it. The tests-run net report surfaces an aggregate
failureKind (infra dominates) so consumers don't rescan results.

Amp-Thread-ID: https://ampcode.com/threads/T-019ecb9a-9a08-733b-833d-76885fc8243a
Co-authored-by: Amp <amp@ampcode.com>
When a slice exhausts its retry budget because the tests never executed
(toolchain missing / deps not installed), the halt reason now reads
"toolchain/install failure" instead of the misdirecting "retry
exhaustion", using the failureKind classified in slice 1.

Deliberately not a bespoke re-install net arc: the loop already loops
back and the agent re-installs natively via bash on its next turn, so the
harness only owns the honest terminal cause. Completes acceptance 1
(classify + react).

Amp-Thread-ID: https://ampcode.com/threads/T-019ecb9a-9a08-733b-833d-76885fc8243a
Co-authored-by: Amp <amp@ampcode.com>
…ice 3)

promoteGreenfieldRun's blanket copy already lands the manifest + lockfile
the cook agent produced; this turns that incidental behavior into an
asserted promotion invariant (git ls-files), guaranteeing a reproducible
promoted tree. Closes acceptance 2 for greenfield; brownfield dep-delta
capture stays blocked on the brownfield-promotion frontier.

Amp-Thread-ID: https://ampcode.com/threads/T-019ecb9a-9a08-733b-833d-76885fc8243a
Co-authored-by: Amp <amp@ampcode.com>
Ran a real brownfield cook (hand-authored 2-slice plan + node:http app,
throwaway repo) to de-risk app-runtime-probe / integration-oracle before
building them.

Verdict: the chain works end-to-end (CoW worktree, clean-tree gate,
per-slice->__epic__ merge composed the wiring, TDD red/green, working
branch untouched). The agent wired the feature reachable and
self-authored a genuine boot-and-probe integration test — the orphan did
not reproduce. But reachability was agent-discretion, not enforced,
confirming the value of an independent integration-oracle. Two
refinements: app-runtime-probe should own the boot mechanism (the agent
had to invent a .js->.ts resolve hook); dep-install was unexercised
(zero-dep app). Bonus: the 'Cannot find module' TDD red was handled as a
test-red, not infra — validates FE-872 slice 1 live.

Marks dogfood-spike done; folds findings into app-runtime-probe and
integration-oracle; updates SPEC A98 to partially-validated.

Amp-Thread-ID: https://ampcode.com/threads/T-019ecb9a-9a08-733b-833d-76885fc8243a
Co-authored-by: Amp <amp@ampcode.com>
Collapse three diverged test-execution paths onto a single TestRunner seam
and one runVerification verdict helper (Design C). evaluate-done and
verify-epic previously used a private spawn-based runTest that returned a
bare boolean, so FE-872's infra-vs-test failureKind was only visible on the
net run-tests path.

- add VerificationOutcome/VerificationResult to types
- add runVerification (test-runner.ts): the one place the >=1-and-all-pass
  verdict rule and the infra-dominates aggregate live; a throwing runner is
  an infra failure
- delete pi-actions.runTest + evaluateVerificationTargets; thread TestRunner
  (and an injectable session factory for tests) through createPiActions
- net-compiler run-tests now calls runVerification, dropping its inline
  loop + aggregate
- evaluate-done and verify-epic now surface failureKind in their reports

Amp-Thread-ID: https://ampcode.com/threads/T-019ecb9a-9a08-733b-833d-76885fc8243a
Co-authored-by: Amp <amp@ampcode.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@kostandinang kostandinang force-pushed the ka/fe-872-dep-install-classification branch from 4ef65e5 to c902500 Compare June 16, 2026 23:45
@kostandinang kostandinang force-pushed the ka/fe-871-brunch-detect branch from f60b7ea to 625b1b5 Compare June 16, 2026 23:45

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Want reviews to match your repository better? Bugbot Learning can learn team-specific rules from PR activity. A team admin can enable Learning in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit c902500. Configure here.

const haltReason =
failureKind === 'infra'
? `Slice ${sliceId} toolchain/install failure (tests never ran in ${maxRetries + 1} attempts)`
: `Slice ${sliceId} retry exhaustion`;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Halt reason overclaims never ran

Medium Severity

When retries are exhausted, an aggregate failureKind of infra always produces a halt reason stating tests never ran in all attempts. That aggregate reflects only the latest runVerification outcome and treats any infra signal (including a single target or a thrown runner) as toolchain failure, so earlier attempts or sibling targets that did execute tests still get a “never ran” terminal message.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit c902500. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant