Red-team benchmark: gate on credential preflight and mark auth-failed runs inconclusive#4319
Red-team benchmark: gate on credential preflight and mark auth-failed runs inconclusive#4319Copilot wants to merge 3 commits into
Conversation
|
@copilot resolve the merge conflicts in this pull request |
Resolved the merge conflicts by merging Addressed in commit |
✅ Coverage Check PassedOverall Coverage
📁 Per-file Coverage Changes (1 files)
Coverage comparison generated by |
Smoke Test: Claude Engine ✅
Total: PASS
|
There was a problem hiding this comment.
Pull request overview
This PR updates the red-team benchmark workflow so runs where the attacker fails to authenticate (or otherwise fails to actually generate attacks) are treated as skipped/inconclusive, preventing false-positive awf_effective: true results.
Changes:
- Adds a preflight credential/API check and propagates skip reasons via step outputs.
- Detects inconclusive runs by inspecting
attempts.jsonl(e.g., 401/unauthorized or no proposals). - Extends
benchmark-summary.jsonwithbenchmark_statusandstatus_reason, and returnsawf_effective: "skipped"when evidence is inconclusive.
Show a summary per file
| File | Description |
|---|---|
| scripts/ci/red-team-benchmark-workflow.test.ts | Updates workflow contract assertions for preflight + new summary fields/reasons. |
| .github/workflows/red-team-benchmark.md | Implements preflight gating, inconclusive detection, and updated summary semantics. |
| .github/workflows/red-team-benchmark.lock.yml | Regenerates the locked workflow to include the new contract and steps. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 3/3 changed files
- Comments generated: 5
| OPENAI_STATUS=$(curl -sS -o /tmp/gh-aw/agent/openai-preflight.json -w "%{http_code}" \ | ||
| https://api.openai.com/v1/responses \ | ||
| -H "$AUTH_HEADER" \ | ||
| -H "Content-Type: application/json" \ | ||
| -d '{"model":"gpt-4o-mini","input":"awf preflight","max_output_tokens":1}' || echo "000") |
| if [ "$OPENAI_STATUS" = "401" ] || [ "$OPENAI_STATUS" = "403" ]; then | ||
| PRECHECK_STATUS="skipped" | ||
| PRECHECK_REASON="OpenAI Responses API auth failed (HTTP $OPENAI_STATUS)" | ||
| echo "::warning::${PRECHECK_REASON}" | ||
| elif [ "$OPENAI_STATUS" = "404" ] || [ "$OPENAI_STATUS" = "000" ]; then | ||
| PRECHECK_STATUS="skipped" | ||
| PRECHECK_REASON="OpenAI Responses API unavailable (HTTP $OPENAI_STATUS)" | ||
| echo "::warning::${PRECHECK_REASON}" | ||
| fi |
| if [ -f /tmp/gh-aw/agent/baseline/attempts.jsonl ] && jq -e 'select((.error // "" | test("401|unauthorized"; "i")))' /tmp/gh-aw/agent/baseline/attempts.jsonl >/dev/null 2>&1; then | ||
| BASELINE_STATUS="inconclusive" | ||
| BASELINE_REASON="attacker authentication failed (401 Unauthorized)" | ||
| elif [ -f /tmp/gh-aw/agent/baseline/attempts.jsonl ] && ! jq -e 'select(.proposal != null)' /tmp/gh-aw/agent/baseline/attempts.jsonl >/dev/null 2>&1; then | ||
| BASELINE_STATUS="inconclusive" | ||
| BASELINE_REASON="attacker produced no proposals" | ||
| fi |
| if [ -f /tmp/gh-aw/agent/awf/attempts.jsonl ] && jq -e 'select((.error // "" | test("401|unauthorized"; "i")))' /tmp/gh-aw/agent/awf/attempts.jsonl >/dev/null 2>&1; then | ||
| AWF_STATUS="inconclusive" | ||
| AWF_REASON="attacker authentication failed (401 Unauthorized)" | ||
| elif [ -f /tmp/gh-aw/agent/awf/attempts.jsonl ] && ! jq -e 'select(.proposal != null)' /tmp/gh-aw/agent/awf/attempts.jsonl >/dev/null 2>&1; then | ||
| AWF_STATUS="inconclusive" | ||
| AWF_REASON="attacker produced no proposals" | ||
| fi |
| AWF_STATUS="inconclusive" | ||
| AWF_REASON="missing claude binary" | ||
| echo "::error::Claude CLI is missing on runner" | ||
| echo '{"skipped":false,"reason":"missing claude binary"}' > /tmp/gh-aw/agent/awf/summary.json | ||
| exit 1 |
🔬 Smoke Test Results
Overall: PASS PR: Red-team benchmark: gate on credential preflight and mark auth-failed runs inconclusive
|
|
Merged PRs:
Warning Firewall blocked 1 domainThe following domain was blocked by the firewall during workflow execution:
network:
allowed:
- defaults
- "registry.npmjs.org"See Network Configuration for more information.
|
🔥 Smoke Test: Copilot BYOK (Direct) — PASS ✅Running in direct BYOK mode (
cc author
|
Smoke Test: Copilot BYOK (Direct) Mode — Azure OpenAI (Foundry, api-key)
Overall: PASS Thanks
|
Smoke Test Results
Overall: FAIL
|
🏗️ Build Test Suite Results
Overall: 8/8 ecosystems passed — ✅ PASS
|
|
Smoke Test Result: FAIL Warning Firewall blocked 1 domainThe following domain was blocked by the firewall during workflow execution:
network:
allowed:
- defaults
- "localhost"See Network Configuration for more information.
|
The red-team benchmark was reporting
awf_effective: truein runs where no attack executed because the attacker failed to authenticate against OpenAI (401 Unauthorized). This made skipped/inconclusive runs look like successful AWF protection.Preflight gating for benchmark execution
Inconclusive-run detection from benchmark artifacts
attempts.jsonlfor:401/ unauthorized),inconclusivewhen attack generation did not actually occur.Summary semantics corrected
benchmark-summary.jsonwith:benchmark_status(completed/skipped/inconclusive)status_reasonawf_effectivelogic to return"skipped"when benchmark evidence is inconclusive, preventing false-positivetrue.Workflow contract updates