fix(adjudicator): refute exploitable verdicts with no evidence anchor + clarify runtime/secret evidence in the prompt#130
Merged
Conversation
… + clarify runtime/secret evidence in the prompt
An internet-facing watcher-server Pod came back `exploitable` ("connects to
exposed secrets which are mounted into the pod…") — a false breach. Its evidence:
CVEs (none), no exposed secret baked into the image, runtime = three benign
NetworkConnections to its own DB/metrics. The 1B judge fabricated evidence by
treating benign connections as a live signal and conflating reaching a secret/…
objective with an exposed secret in the image. Correct verdict: refuted.
Add the symmetric backstop to guard_fabricated_cve: guard_unsupported_exploitable
downgrades an Exploitable verdict to Refuted ONLY when ALL THREE exploitation
anchors are absent — empty CVE list, no exposed-secret finding, and no
corroborating runtime behavior (Behavior::is_alert() or exec_class::notable_exec,
the engine's existing definition; benign Network/File/Library/SecretRead are NOT
corroborating). Any anchor present leaves the model's call untouched. Wired after
guard_fabricated_cve in model_call; exposed-secret presence read from the same
entry_findings source the prompt uses.
Also two surgical prompt clarifications: a workload's own activity (network
connections, file reads, library loads, reading its own mounted secrets) is NOT a
live signal — only an ALERT or hands-on-keyboard action is; and reaching a
secret/… objective is NOT an exposed secret baked into the image. This shifts the
verdict fingerprint, so entries re-judge once.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01VtjoJttCvBY4dzCoE4f9vP
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
An internet-facing
watcher-serverPod came backexploitablewith reason "connects to exposed secrets which are mounted into the pod…" — a false breach. Its evidence: CVEs(none), no exposed secret baked into the image, runtime behavior = three benignNetworkConnections to its own DB/metrics; all objectives[MOUNTED](own creds) or[NETWORK] [same-ns](own DB). Correct verdict:refuted. The 1B judge fabricated evidence by (a) treating benign network connections as a live signal and (b) conflating reaching asecret/…objective with an exposed secret in the image.There was already a
guard_fabricated_cvebackstop; this adds the symmetric zero-anchor one for unsupportedexploitable.The guard
guard_unsupported_exploitable(inguards.rs, mirroringguard_fabricated_cve's shape via the sharedguard_exploitablegate) downgrades anExploitableverdict toRefutedONLY when ALL THREE exploitation anchors are absent:"Corroborating runtime behavior" reuses the engine's existing definition —
Behavior::is_alert()(a critical Falco alert) ORexec_class::notable_exec(&behavior).is_some()(a notable shell/pkg-manager exec, JEF-117). BenignNetworkConnection/FileRead/LibraryLoaded/SecretReadare not corroborating and never anchor an exploitable.Conservative by design: if any anchor is present — a CVE in the list (even reachability:not-observed), an exposed secret, or a corroborating behavior — the model's (debatable) call stands untouched. This is purely the zero-anchor safety net. Like the fabrication guard it only ever acts on
Exploitable; the entry is re-judged next pass.Exposed-secret presence is read from the same source the prompt uses —
entry_findings(graph, entry)returns(secret_lines, posture_lines); a non-emptysecret_linesmeans a usable credential is baked into the image (posture/RBAC is not an anchor). Wired inmodel_call.rschained afterguard_fabricated_cve.Prompt clarifications
Two surgical additions (existing structure/wording preserved):
secret/…objective (a Credential-Access OUTCOME in the reachable-objectives list) is NOT an exposed secret baked into the image — only a credential in the "Exposed secrets baked into this image" field is exploitation evidence.Fingerprint shift: changing the prompt string deterministically shifts the verdict-cache fingerprint inputs at the prompt level, so entries re-judge once. Expected. No code-level snapshot pins the prompt text; the only test affected was the prompt-size bound (raised from 4,000 to 5,000 to account for the larger static template — the assertion still proves the untrusted-payload cap, since a megabyte title would blow past it by orders of magnitude).
Tests
Exploitable+ empty CVEs + no exposed secret + only benign behaviors (the watcher case + misc benign) →Refuted.Exploitableverdicts (Refuted/Confirmed/Uncertain) untouched.All existing adjudicate tests kept green.
Gates (from
engine/)cargo fmt·cargo build·cargo clippy --all-targets(clean, warnings = errors) ·cargo test— 353 passed, 0 failed, 1 ignored (the e2e test needingPROTECTOR_E2E_MODEL). File-size guard green.Closes JEF-watcher-false-breach.
🤖 Generated with Claude Code