Skip to content

Red-team benchmark: gate on credential preflight and mark auth-failed runs inconclusive#4319

Open
Copilot wants to merge 3 commits into
mainfrom
copilot/red-team-benchmark-fix
Open

Red-team benchmark: gate on credential preflight and mark auth-failed runs inconclusive#4319
Copilot wants to merge 3 commits into
mainfrom
copilot/red-team-benchmark-fix

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Jun 4, 2026

The red-team benchmark was reporting awf_effective: true in runs where no attack executed because the attacker failed to authenticate against OpenAI (401 Unauthorized). This made skipped/inconclusive runs look like successful AWF protection.

  • Preflight gating for benchmark execution

    • Added a preflight step to validate benchmark prerequisites before either baseline or AWF-protected run proceeds.
    • Captures explicit skip reasons (missing API keys, OpenAI Responses API auth/access failure) and propagates them via workflow outputs.
  • Inconclusive-run detection from benchmark artifacts

    • Added artifact-based checks on attempts.jsonl for:
      • attacker auth failures (401 / unauthorized),
      • no attacker proposals generated.
    • Marks baseline/AWF runs as inconclusive when attack generation did not actually occur.
  • Summary semantics corrected

    • Extended benchmark-summary.json with:
      • benchmark_status (completed / skipped / inconclusive)
      • status_reason
    • Updated awf_effective logic to return "skipped" when benchmark evidence is inconclusive, preventing false-positive true.
  • Workflow contract updates

    • Updated red-team benchmark workflow test coverage to assert the new preflight and summary-status contract.
    • Regenerated the corresponding lock workflow from source.
# Example of new summary contract
{
  "benchmark_status": "inconclusive",
  "status_reason": "attacker authentication failed (401 Unauthorized)",
  "awf_effective": "skipped"
}

Copilot AI changed the title [WIP] Fix authentication issue for AWF Red-Team Benchmark Red-team benchmark: gate on credential preflight and mark auth-failed runs inconclusive Jun 4, 2026
Copilot AI requested a review from lpcox June 4, 2026 15:03
Copilot finished work on behalf of lpcox June 4, 2026 15:03
@lpcox
Copy link
Copy Markdown
Collaborator

lpcox commented Jun 5, 2026

@copilot resolve the merge conflicts in this pull request

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Jun 5, 2026

@copilot resolve the merge conflicts in this pull request

Resolved the merge conflicts by merging origin/main into this branch and fixing the conflict in .github/workflows/red-team-benchmark.lock.yml.

Addressed in commit 03cbd25.

Copilot finished work on behalf of lpcox June 5, 2026 18:20
@lpcox lpcox marked this pull request as ready for review June 5, 2026 19:30
Copilot AI review requested due to automatic review settings June 5, 2026 19:30
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 5, 2026

✅ Coverage Check Passed

Overall Coverage

Metric Base PR Delta
Lines 96.61% 96.65% 📈 +0.04%
Statements 96.51% 96.55% 📈 +0.04%
Functions 98.70% 98.70% ➡️ +0.00%
Branches 91.03% 91.07% 📈 +0.04%
📁 Per-file Coverage Changes (1 files)
File Lines (Before → After) Statements (Before → After)
src/config-writer.ts 89.3% → 90.9% (+1.65%) 89.3% → 90.9% (+1.65%)

Coverage comparison generated by scripts/ci/compare-coverage.ts

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 5, 2026

Smoke Test: Claude Engine ✅

Total: PASS

💥 [THE END] — Illustrated by Smoke Claude

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the red-team benchmark workflow so runs where the attacker fails to authenticate (or otherwise fails to actually generate attacks) are treated as skipped/inconclusive, preventing false-positive awf_effective: true results.

Changes:

  • Adds a preflight credential/API check and propagates skip reasons via step outputs.
  • Detects inconclusive runs by inspecting attempts.jsonl (e.g., 401/unauthorized or no proposals).
  • Extends benchmark-summary.json with benchmark_status and status_reason, and returns awf_effective: "skipped" when evidence is inconclusive.
Show a summary per file
File Description
scripts/ci/red-team-benchmark-workflow.test.ts Updates workflow contract assertions for preflight + new summary fields/reasons.
.github/workflows/red-team-benchmark.md Implements preflight gating, inconclusive detection, and updated summary semantics.
.github/workflows/red-team-benchmark.lock.yml Regenerates the locked workflow to include the new contract and steps.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 3/3 changed files
  • Comments generated: 5

Comment on lines +149 to +153
OPENAI_STATUS=$(curl -sS -o /tmp/gh-aw/agent/openai-preflight.json -w "%{http_code}" \
https://api.openai.com/v1/responses \
-H "$AUTH_HEADER" \
-H "Content-Type: application/json" \
-d '{"model":"gpt-4o-mini","input":"awf preflight","max_output_tokens":1}' || echo "000")
Comment on lines +154 to +162
if [ "$OPENAI_STATUS" = "401" ] || [ "$OPENAI_STATUS" = "403" ]; then
PRECHECK_STATUS="skipped"
PRECHECK_REASON="OpenAI Responses API auth failed (HTTP $OPENAI_STATUS)"
echo "::warning::${PRECHECK_REASON}"
elif [ "$OPENAI_STATUS" = "404" ] || [ "$OPENAI_STATUS" = "000" ]; then
PRECHECK_STATUS="skipped"
PRECHECK_REASON="OpenAI Responses API unavailable (HTTP $OPENAI_STATUS)"
echo "::warning::${PRECHECK_REASON}"
fi
Comment on lines +197 to +203
if [ -f /tmp/gh-aw/agent/baseline/attempts.jsonl ] && jq -e 'select((.error // "" | test("401|unauthorized"; "i")))' /tmp/gh-aw/agent/baseline/attempts.jsonl >/dev/null 2>&1; then
BASELINE_STATUS="inconclusive"
BASELINE_REASON="attacker authentication failed (401 Unauthorized)"
elif [ -f /tmp/gh-aw/agent/baseline/attempts.jsonl ] && ! jq -e 'select(.proposal != null)' /tmp/gh-aw/agent/baseline/attempts.jsonl >/dev/null 2>&1; then
BASELINE_STATUS="inconclusive"
BASELINE_REASON="attacker produced no proposals"
fi
Comment on lines +261 to +267
if [ -f /tmp/gh-aw/agent/awf/attempts.jsonl ] && jq -e 'select((.error // "" | test("401|unauthorized"; "i")))' /tmp/gh-aw/agent/awf/attempts.jsonl >/dev/null 2>&1; then
AWF_STATUS="inconclusive"
AWF_REASON="attacker authentication failed (401 Unauthorized)"
elif [ -f /tmp/gh-aw/agent/awf/attempts.jsonl ] && ! jq -e 'select(.proposal != null)' /tmp/gh-aw/agent/awf/attempts.jsonl >/dev/null 2>&1; then
AWF_STATUS="inconclusive"
AWF_REASON="attacker produced no proposals"
fi
Comment on lines +231 to 235
AWF_STATUS="inconclusive"
AWF_REASON="missing claude binary"
echo "::error::Claude CLI is missing on runner"
echo '{"skipped":false,"reason":"missing claude binary"}' > /tmp/gh-aw/agent/awf/summary.json
exit 1
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 5, 2026

🔬 Smoke Test Results

Test Result
GitHub MCP connectivity
GitHub.com HTTP ✅ 200
File write/read

Overall: PASS

PR: Red-team benchmark: gate on credential preflight and mark auth-failed runs inconclusive
Author: @Copilot · Assignees: @lpcox, @Copilot

📰 BREAKING: Report filed by Smoke Copilot

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 5, 2026

Merged PRs:

  • Refactor WrapperConfig type composition: split CLI proxy and Docker host options out of ApiProxyOptions
  • fix: fail fast on unreachable cli-proxy to stop contribution-check token exhaustion
  • Build AWF smoke test
  • GitHub title check
  • File write and verify
  • Discussion comment
    Status: PASS

Warning

Firewall blocked 1 domain

The following domain was blocked by the firewall during workflow execution:

  • registry.npmjs.org

To allow these domains, add them to the network.allowed list in your workflow frontmatter:

network:
  allowed:
    - defaults
    - "registry.npmjs.org"

See Network Configuration for more information.

🔮 The oracle has spoken through Smoke Codex

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 5, 2026

🔥 Smoke Test: Copilot BYOK (Direct) — PASS

Running in direct BYOK mode (COPILOT_PROVIDER_API_KEY) via api-proxy → api.githubcopilot.com

Test Result
GitHub MCP — latest PR: ci: smoke-test Copilot BYOK against Azure OpenAI Foundry (api-key)
GitHub.com connectivity (HTTP 200)
File write/read
BYOK inference

cc author @Copilot · assignees @lpcox @Copilot

🔑 BYOK report filed by Smoke Copilot BYOK

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 5, 2026

Smoke Test: Copilot BYOK (Direct) Mode — Azure OpenAI (Foundry, api-key)

  • PR Title: Red-team benchmark: gate on credential preflight and mark auth-failed runs inconclusive
  • GitHub MCP connectivity: ✅
  • GitHub.com HTTP connectivity: ✅
  • File write/read test: ✅
  • BYOK inference via api-proxy sidecar to Azure OpenAI: ✅
  • Running in direct BYOK mode (COPILOT_PROVIDER_API_KEY + COPILOT_PROVIDER_BASE_URL) via api-proxy → Azure OpenAI (Foundry, o4-mini-aw)

Overall: PASS

Thanks @Copilot and @lpcox!

🔑 BYOK (AOAI api-key) report filed by Smoke Copilot BYOK AOAI (api-key)

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 5, 2026

Smoke Test Results

  • Redis PING: ❌ connection timed out (port 6379)
  • PostgreSQL pg_isready: ❌ no response (port 5432)
  • PostgreSQL SELECT 1: ❌ no response (port 5432)

host.docker.internal resolves to 172.17.0.1 but both service ports are unreachable (TCP timeout). Services may not be running or ports are firewalled from this runner context.

Overall: FAIL

🔌 Service connectivity validated by Smoke Services

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 5, 2026

🏗️ Build Test Suite Results

Ecosystem Project Build/Install Tests Status
Bun elysia 1/1 passed ✅ PASS
Bun hono 1/1 passed ✅ PASS
C++ fmt N/A ✅ PASS
C++ json N/A ✅ PASS
Deno oak N/A 1/1 passed ✅ PASS
Deno std N/A 1/1 passed ✅ PASS
.NET hello-world N/A ✅ PASS
.NET json-parse N/A ✅ PASS
Go color 1/1 passed ✅ PASS
Go env 1/1 passed ✅ PASS
Go uuid 1/1 passed ✅ PASS
Java gson 1/1 passed ✅ PASS
Java caffeine 1/1 passed ✅ PASS
Node.js clsx passed ✅ PASS
Node.js execa passed ✅ PASS
Node.js p-limit passed ✅ PASS
Rust fd 1/1 passed ✅ PASS
Rust zoxide 1/1 passed ✅ PASS

Overall: 8/8 ecosystems passed — ✅ PASS

Generated by Build Test Suite for issue #4319 · sonnet46 1.3M ·

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 5, 2026

Smoke Test Result: FAIL

Warning

Firewall blocked 1 domain

The following domain was blocked by the firewall during workflow execution:

  • localhost

To allow these domains, add them to the network.allowed list in your workflow frontmatter:

network:
  allowed:
    - defaults
    - "localhost"

See Network Configuration for more information.

💎 Faceted by Smoke Gemini

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Red-Team Benchmark] AWF Red-Team Benchmark — 2026-06-02 — SKIPPED

3 participants