Red-team benchmark: gate on credential preflight and mark auth-failed runs inconclusive by Copilot · Pull Request #4319 · github/gh-aw-firewall

Copilot · 2026-06-04T14:37:37Z

The red-team benchmark was reporting awf_effective: true in runs where no attack executed because the attacker failed to authenticate against OpenAI (401 Unauthorized). This made skipped/inconclusive runs look like successful AWF protection.

Preflight gating for benchmark execution
- Added a preflight step to validate benchmark prerequisites before either baseline or AWF-protected run proceeds.
- Captures explicit skip reasons (missing API keys, OpenAI Responses API auth/access failure) and propagates them via workflow outputs.
Inconclusive-run detection from benchmark artifacts
- Added artifact-based checks on attempts.jsonl for:
  - attacker auth failures (401 / unauthorized),
  - no attacker proposals generated.
- Marks baseline/AWF runs as inconclusive when attack generation did not actually occur.
Summary semantics corrected
- Extended benchmark-summary.json with:
  - benchmark_status (completed / skipped / inconclusive)
  - status_reason
- Updated awf_effective logic to return "skipped" when benchmark evidence is inconclusive, preventing false-positive true.
Workflow contract updates
- Updated red-team benchmark workflow test coverage to assert the new preflight and summary-status contract.
- Regenerated the corresponding lock workflow from source.

# Example of new summary contract
{
  "benchmark_status": "inconclusive",
  "status_reason": "attacker authentication failed (401 Unauthorized)",
  "awf_effective": "skipped"
}

lpcox · 2026-06-05T18:14:54Z

@copilot resolve the merge conflicts in this pull request

Copilot · 2026-06-05T18:20:14Z

@copilot resolve the merge conflicts in this pull request

Resolved the merge conflicts by merging origin/main into this branch and fixing the conflict in .github/workflows/red-team-benchmark.lock.yml.

Addressed in commit 03cbd25.

github-actions · 2026-06-05T19:32:02Z

✅ Coverage Check Passed

Overall Coverage

Metric	Base	PR	Delta
Lines	96.61%	96.65%	📈 +0.04%
Statements	96.51%	96.55%	📈 +0.04%
Functions	98.70%	98.70%	➡️ +0.00%
Branches	91.03%	91.07%	📈 +0.04%

📁 Per-file Coverage Changes (1 files)

File	Lines (Before → After)	Statements (Before → After)
`src/config-writer.ts`	89.3% → 90.9% (+1.65%)	89.3% → 90.9% (+1.65%)

Coverage comparison generated by scripts/ci/compare-coverage.ts

github-actions · 2026-06-05T19:33:43Z

Smoke Test: Claude Engine ✅

✅ GitHub API: 2 PR entries confirmed (fix: fail fast on unreachable cli-proxy to stop contribution-check token exhaustion #4376, fix: propagate config fields to all layers #4372)
✅ GitHub check: playwright_check=✅ PASS
✅ File verify: smoke-test-claude-27032417603.txt exists

Total: PASS

💥 [THE END] — Illustrated by Smoke Claude

Copilot

Pull request overview

This PR updates the red-team benchmark workflow so runs where the attacker fails to authenticate (or otherwise fails to actually generate attacks) are treated as skipped/inconclusive, preventing false-positive awf_effective: true results.

Changes:

Adds a preflight credential/API check and propagates skip reasons via step outputs.
Detects inconclusive runs by inspecting attempts.jsonl (e.g., 401/unauthorized or no proposals).
Extends benchmark-summary.json with benchmark_status and status_reason, and returns awf_effective: "skipped" when evidence is inconclusive.

Show a summary per file

File	Description
scripts/ci/red-team-benchmark-workflow.test.ts	Updates workflow contract assertions for preflight + new summary fields/reasons.
.github/workflows/red-team-benchmark.md	Implements preflight gating, inconclusive detection, and updated summary semantics.
.github/workflows/red-team-benchmark.lock.yml	Regenerates the locked workflow to include the new contract and steps.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Files reviewed: 3/3 changed files
Comments generated: 5

+        OPENAI_STATUS=$(curl -sS -o /tmp/gh-aw/agent/openai-preflight.json -w "%{http_code}" \
+          https://api.openai.com/v1/responses \
+          -H "$AUTH_HEADER" \
+          -H "Content-Type: application/json" \
+          -d '{"model":"gpt-4o-mini","input":"awf preflight","max_output_tokens":1}' || echo "000")


+        if [ "$OPENAI_STATUS" = "401" ] || [ "$OPENAI_STATUS" = "403" ]; then
+          PRECHECK_STATUS="skipped"
+          PRECHECK_REASON="OpenAI Responses API auth failed (HTTP $OPENAI_STATUS)"
+          echo "::warning::${PRECHECK_REASON}"
+        elif [ "$OPENAI_STATUS" = "404" ] || [ "$OPENAI_STATUS" = "000" ]; then
+          PRECHECK_STATUS="skipped"
+          PRECHECK_REASON="OpenAI Responses API unavailable (HTTP $OPENAI_STATUS)"
+          echo "::warning::${PRECHECK_REASON}"
+        fi


+        if [ -f /tmp/gh-aw/agent/baseline/attempts.jsonl ] && jq -e 'select((.error // "" | test("401|unauthorized"; "i")))' /tmp/gh-aw/agent/baseline/attempts.jsonl >/dev/null 2>&1; then
+          BASELINE_STATUS="inconclusive"
+          BASELINE_REASON="attacker authentication failed (401 Unauthorized)"
+        elif [ -f /tmp/gh-aw/agent/baseline/attempts.jsonl ] && ! jq -e 'select(.proposal != null)' /tmp/gh-aw/agent/baseline/attempts.jsonl >/dev/null 2>&1; then
+          BASELINE_STATUS="inconclusive"
+          BASELINE_REASON="attacker produced no proposals"
+        fi


+        if [ -f /tmp/gh-aw/agent/awf/attempts.jsonl ] && jq -e 'select((.error // "" | test("401|unauthorized"; "i")))' /tmp/gh-aw/agent/awf/attempts.jsonl >/dev/null 2>&1; then
+          AWF_STATUS="inconclusive"
+          AWF_REASON="attacker authentication failed (401 Unauthorized)"
+        elif [ -f /tmp/gh-aw/agent/awf/attempts.jsonl ] && ! jq -e 'select(.proposal != null)' /tmp/gh-aw/agent/awf/attempts.jsonl >/dev/null 2>&1; then
+          AWF_STATUS="inconclusive"
+          AWF_REASON="attacker produced no proposals"
+        fi


+        AWF_STATUS="inconclusive"
+        AWF_REASON="missing claude binary"
        echo "::error::Claude CLI is missing on runner"
        echo '{"skipped":false,"reason":"missing claude binary"}' > /tmp/gh-aw/agent/awf/summary.json
        exit 1


github-actions · 2026-06-05T19:34:17Z

🔬 Smoke Test Results

Test	Result
GitHub MCP connectivity	✅
GitHub.com HTTP	✅ 200
File write/read	✅

Overall: PASS

PR: Red-team benchmark: gate on credential preflight and mark auth-failed runs inconclusive
Author: @Copilot · Assignees: @lpcox, @Copilot

📰 BREAKING: Report filed by Smoke Copilot

github-actions · 2026-06-05T19:34:30Z

Merged PRs:

Refactor WrapperConfig type composition: split CLI proxy and Docker host options out of ApiProxyOptions
fix: fail fast on unreachable cli-proxy to stop contribution-check token exhaustion
Build AWF smoke test
GitHub title check
File write and verify
Discussion comment
Status: PASS

Warning

Firewall blocked 1 domain

The following domain was blocked by the firewall during workflow execution:

registry.npmjs.org

To allow these domains, add them to the network.allowed list in your workflow frontmatter:

network:
  allowed:
    - defaults
    - "registry.npmjs.org"

See Network Configuration for more information.

🔮 The oracle has spoken through Smoke Codex

github-actions · 2026-06-05T19:35:24Z

🔥 Smoke Test: Copilot BYOK (Direct) — PASS ✅

Running in direct BYOK mode (COPILOT_PROVIDER_API_KEY) via api-proxy → api.githubcopilot.com

Test	Result
GitHub MCP — latest PR: ci: smoke-test Copilot BYOK against Azure OpenAI Foundry (api-key)	✅
GitHub.com connectivity (HTTP 200)	✅
File write/read	✅
BYOK inference	✅

cc author @Copilot · assignees @lpcox @Copilot

🔑 BYOK report filed by Smoke Copilot BYOK

github-actions · 2026-06-05T19:35:41Z

Smoke Test: Copilot BYOK (Direct) Mode — Azure OpenAI (Foundry, api-key)

PR Title: Red-team benchmark: gate on credential preflight and mark auth-failed runs inconclusive
GitHub MCP connectivity: ✅
GitHub.com HTTP connectivity: ✅
File write/read test: ✅
BYOK inference via api-proxy sidecar to Azure OpenAI: ✅
Running in direct BYOK mode (COPILOT_PROVIDER_API_KEY + COPILOT_PROVIDER_BASE_URL) via api-proxy → Azure OpenAI (Foundry, o4-mini-aw)

Overall: PASS

Thanks @Copilot and @lpcox!

🔑 BYOK (AOAI api-key) report filed by Smoke Copilot BYOK AOAI (api-key)

github-actions · 2026-06-05T19:35:51Z

Smoke Test Results

Redis PING: ❌ connection timed out (port 6379)
PostgreSQL pg_isready: ❌ no response (port 5432)
PostgreSQL SELECT 1: ❌ no response (port 5432)

host.docker.internal resolves to 172.17.0.1 but both service ports are unreachable (TCP timeout). Services may not be running or ports are firewalled from this runner context.

Overall: FAIL

🔌 Service connectivity validated by Smoke Services

github-actions · 2026-06-05T19:36:07Z

🏗️ Build Test Suite Results

Ecosystem	Project	Build/Install	Tests	Status
Bun	elysia	✅	1/1 passed	✅ PASS
Bun	hono	✅	1/1 passed	✅ PASS
C++	fmt	✅	N/A	✅ PASS
C++	json	✅	N/A	✅ PASS
Deno	oak	N/A	1/1 passed	✅ PASS
Deno	std	N/A	1/1 passed	✅ PASS
.NET	hello-world	✅	N/A	✅ PASS
.NET	json-parse	✅	N/A	✅ PASS
Go	color	✅	1/1 passed	✅ PASS
Go	env	✅	1/1 passed	✅ PASS
Go	uuid	✅	1/1 passed	✅ PASS
Java	gson	✅	1/1 passed	✅ PASS
Java	caffeine	✅	1/1 passed	✅ PASS
Node.js	clsx	✅	passed	✅ PASS
Node.js	execa	✅	passed	✅ PASS
Node.js	p-limit	✅	passed	✅ PASS
Rust	fd	✅	1/1 passed	✅ PASS
Rust	zoxide	✅	1/1 passed	✅ PASS

Overall: 8/8 ecosystems passed — ✅ PASS

Generated by Build Test Suite for issue #4319 · sonnet46 1.3M · ◷

github-actions · 2026-06-05T19:36:22Z

Smoke Test Result: FAIL

Warning

Firewall blocked 1 domain

The following domain was blocked by the firewall during workflow execution:

localhost

To allow these domains, add them to the network.allowed list in your workflow frontmatter:

network:
  allowed:
    - defaults
    - "localhost"

See Network Configuration for more information.

💎 Faceted by Smoke Gemini

Initial plan

90f3ac5

Copilot AI assigned Copilot and lpcox Jun 4, 2026

Copilot started work on behalf of lpcox June 4, 2026 14:52 View session

Copilot AI linked an issue Jun 4, 2026 that may be closed by this pull request

[Red-Team Benchmark] AWF Red-Team Benchmark — 2026-06-02 — SKIPPED #4193

Closed

fix: mark red-team benchmark inconclusive on auth failures

9e5f270

Copilot AI changed the title ~~[WIP] Fix authentication issue for AWF Red-Team Benchmark~~ Red-team benchmark: gate on credential preflight and mark auth-failed runs inconclusive Jun 4, 2026

Copilot AI requested a review from lpcox June 4, 2026 15:03

Copilot finished work on behalf of lpcox June 4, 2026 15:03

Copilot started work on behalf of lpcox June 5, 2026 18:15 View session

chore: merge origin/main and resolve red-team workflow conflict

03cbd25

Copilot finished work on behalf of lpcox June 5, 2026 18:20

lpcox marked this pull request as ready for review June 5, 2026 19:30

Copilot AI review requested due to automatic review settings June 5, 2026 19:30

Copilot started reviewing on behalf of lpcox June 5, 2026 19:30 View session

github-actions Bot added the smoke-claude label Jun 5, 2026

Copilot AI reviewed Jun 5, 2026

View reviewed changes

github-actions Bot added the smoke-copilot label Jun 5, 2026

github-actions Bot added the smoke-codex label Jun 5, 2026

github-actions Bot added the smoke-copilot-byok label Jun 5, 2026

github-actions Bot added the smoke-copilot-byok-aoai-apikey label Jun 5, 2026

github-actions Bot added the build-test label Jun 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Red-team benchmark: gate on credential preflight and mark auth-failed runs inconclusive#4319

Red-team benchmark: gate on credential preflight and mark auth-failed runs inconclusive#4319
Copilot wants to merge 3 commits into
mainfrom
copilot/red-team-benchmark-fix

Copilot AI commented Jun 4, 2026 •

edited

Loading

Uh oh!

lpcox commented Jun 5, 2026

Uh oh!

Copilot AI commented Jun 5, 2026

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Copilot AI commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lpcox commented Jun 5, 2026

Uh oh!

Copilot AI commented Jun 5, 2026

Uh oh!

github-actions Bot commented Jun 5, 2026

✅ Coverage Check Passed

Overall Coverage

Uh oh!

github-actions Bot commented Jun 5, 2026

Smoke Test: Claude Engine ✅

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Copilot's findings

Uh oh!

github-actions Bot commented Jun 5, 2026

🔬 Smoke Test Results

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

github-actions Bot commented Jun 5, 2026

🔥 Smoke Test: Copilot BYOK (Direct) — PASS ✅

Uh oh!

github-actions Bot commented Jun 5, 2026

Smoke Test: Copilot BYOK (Direct) Mode — Azure OpenAI (Foundry, api-key)

Uh oh!

github-actions Bot commented Jun 5, 2026

Smoke Test Results

Uh oh!

github-actions Bot commented Jun 5, 2026

🏗️ Build Test Suite Results

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented Jun 4, 2026 •

edited

Loading