Add /run-evals comment command#1948
Conversation
…orkflow plumbing) Adds the workflow-side plumbing so a PR comment can launch an eval-only run of a single recipe (no perf sweep): /run-evals <eval> <config-key> [conc] [master-config] - run-evals.yml: new issue_comment command (mirrors pr-comment-sweep.yml auth/ SHA-pin/reply); maps <eval> -> framework+task, infers nvidia/amd master config from the config-key HW token, builds 'test-config ... --evals-only', calls e2e-tests.yml with eval-framework/eval-task. - e2e-tests.yml: new eval-framework/eval-task inputs, threaded into the eval jobs. - benchmark-tmpl.yml / benchmark-multinode-tmpl.yml: matching inputs -> env (EVAL_FRAMEWORK / EVAL_TASKS_DIR). Inert by default (eval-framework defaults to lm-eval). The framework-dispatch code (run_eval EVAL_FRAMEWORK override + run_swebench_eval + swebench task/scorer) lives with the swebench PR and is checked out from the commented PR's head at runtime, so this can merge to main independently.
|
@adibarra isnt there eval-only tag that does this? |
| eval-framework: | ||
| description: "Eval framework (lm-eval | swebench). Overrides the recipe default." | ||
| required: false | ||
| type: string | ||
| default: "lm-eval" |
There was a problem hiding this comment.
🟡 The new eval-framework input is declared inconsistently across the three workflows: e2e-tests.yml (lines 24-28 and 53-57) uses default: "lm-eval" with description "Overrides the recipe default", while benchmark-tmpl.yml (lines 62-66) and benchmark-multinode-tmpl.yml (lines 94-98) use default: "" with description "Empty = recipe default." Since e2e-tests.yml unconditionally forwards inputs.eval-framework to the templates, callers that leave the field at default always send lm-eval downstream, so the templates' "Empty = recipe default" branch is unreachable through this entry point. Functionally harmless (consumer benchmark_lib.sh:974 does ${EVAL_FRAMEWORK:-lm-eval} so both empty and lm-eval resolve identically) — worth aligning the description and default across the three files for clarity.
Extended reasoning...
What the inconsistency is
The new eval-framework input is declared in three workflow files with two different defaults and two different descriptions:
| File | Default | Description |
|---|---|---|
.github/workflows/e2e-tests.yml (lines 24-28, 53-57) |
"lm-eval" |
"Eval framework (lm-eval | swebench). Overrides the recipe default." |
.github/workflows/benchmark-tmpl.yml (lines 62-66) |
"" |
"Eval framework (lm-eval | swebench). Empty = recipe default." |
.github/workflows/benchmark-multinode-tmpl.yml (lines 94-98) |
"" |
"Eval framework (lm-eval | swebench). Empty = recipe default." |
What the consumer actually does
There is no per-recipe default for the eval framework. In benchmarks/benchmark_lib.sh:974:
local framework="${EVAL_FRAMEWORK:-lm-eval}"So "recipe default" is really just a hard-coded lm-eval fallback. Neither description quite matches that: e2e-tests.yml suggests there is a recipe default being overridden, and the templates suggest empty resolves to a per-recipe default — but in reality empty and lm-eval resolve to the same thing.
Reachability of the templates' empty branch
Because e2e-tests.yml unconditionally forwards inputs.eval-framework to both templates (lines 183 and 319):
eval-framework: ${{ inputs.eval-framework }}any caller that uses the e2e-tests.yml defaults sends lm-eval (not "") downstream. The "Empty = recipe default" code path documented on the template is therefore unreachable via e2e-tests.yml unless a user explicitly passes eval-framework="" in workflow_dispatch. The new run-evals.yml similarly only ever forwards lm-eval or swebench, never empty.
Addressing the refutation
One reviewer argued this is intentional layered cascading: the templates expose a "use my default" knob via empty, and e2e-tests.yml is one layer above that always supplies a concrete value. That framing is defensible, but two things weaken it: (1) the templates' description mentions a "recipe default" that does not actually exist in the consumer — there is no per-recipe configuration of the eval framework anywhere — and (2) since lm-eval and "" resolve to the same downstream value, the layered cascade has no observable effect. The remaining issue is purely documentation: a reader looking at the templates would reasonably expect an empty value to dispatch to a per-recipe default, when in fact it just falls through to the same hardcoded lm-eval.
Impact
None at runtime. This is documentation/UX clarity only. The behavior is correct in all current call paths.
Suggested fix
Pick one of:
- Make
e2e-tests.ymlusedefault: ""and the same "Empty = recipe default" description as the templates, so callers that omit the input naturally exercise the template's empty branch. Then update the description on all three to reflect the actual behavior, e.g. "Empty = lm-eval". - Or, keep
e2e-tests.ymlasdefault: "lm-eval"and change the templates' description to "Empty = lm-eval (the default framework)", and acknowledge that nothing currently sends empty through this path.
Either way, aligning the three files removes the contradiction in what the field means.
Step-by-step proof
- Trigger
run-evals.ymlvia/run-evals gsm8k <config-key>(or rune2e-tests.ymldirectly via workflow_dispatch leaving the eval-framework field at default). run-evals.ymlsetseval-framework=lm-eval(from the case mapping) ore2e-tests.ymlfalls back to its declared default"lm-eval".e2e-tests.ymljobtest-sweep-evalsforwardseval-framework: ${{ inputs.eval-framework }}→"lm-eval"tobenchmark-tmpl.yml.benchmark-tmpl.ymlsetsEVAL_FRAMEWORK: "lm-eval"in the job env.benchmark_lib.sh:974evaluates${EVAL_FRAMEWORK:-lm-eval}→"lm-eval".- The templates' empty default (
"") is never observed; the "Empty = recipe default" docstring describes a code path that no caller exercises.
| - name: Reply with run link | ||
| if: ${{ github.event_name == 'issue_comment' && startsWith(github.event.comment.body, '/run-evals') && github.repository_owner == 'SemiAnalysisAI' }} | ||
| uses: actions/github-script@3a2844b7e9c422d3c10d287c895573f7108da1b3 # v9.0.0 | ||
| continue-on-error: true |
There was a problem hiding this comment.
🟡 On parse failures, the 'Reply with run link' step is skipped (its if: lacks always()), so legitimate users who typo an eval name, forget the config-key, or use an unrecognized hardware token get no PR comment back and must dig into Actions logs to find out why /run-evals did nothing. Severity: nit (UX-only). Consider if: ${{ always() && ... }} plus a branch on steps.parse.outcome so failures still produce a reply explaining the error.
Extended reasoning...
What the bug is. The 'Reply with run link' step at .github/workflows/run-evals.yml:138-141 uses an if: expression with no status function (success()/failure()/always()):\n\nyaml\nif: ${{ github.event_name == 'issue_comment' && startsWith(github.event.comment.body, '/run-evals') && github.repository_owner == 'SemiAnalysisAI' }}\n\n\nPer GitHub Actions semantics, when an if: omits any status function it gets an implicit success() && prepended. So this step only runs when every previous step in the job succeeded.\n\nThe triggering code path. The 'Parse PR comment' step runs under set -euo pipefail and explicitly exit 1s on five validation paths:\n\n1. no /run-evals line at comment start;\n2. missing eval_name or config_key;\n3. eval_name not in the case (gsm8k / gpqa / swebench_lite);\n4. conc not matching ^[1-9][0-9]*$;\n5. no recognized hardware token in config_key and no 4th-arg master override.\n\nAny of these flips the parse step to failure, and the implicit-success guard then skips the reply step. continue-on-error: true on the reply step doesn't help — that flag controls outbound failure propagation, not whether the step runs after an upstream failure.\n\nWhy it's user-visible here. /run-evals has substantially stricter parsing than the /sweep workflow it mirrors. /sweep effectively forwards the raw args, so its parse step almost never fails. /run-evals has 5 distinct exit-1 paths, all of which are realistic user mistakes:\n\n- /run-evals gsm8k (forgot config-key) → exits 1 with usage: to stderr.\n- /run-evals gsmK dsr1-fp4-b200-sglang (typo) → exits 1 with 'unknown eval'.\n- /run-evals gsm8k my-custom-config (no recognized hw token, no 4th-arg) → exits 1 with 'cannot infer platform'.\n\nIn every case the user sees nothing happen on the PR. They have to notice the failed run in the Actions tab and open the log to see what they got wrong.\n\nStep-by-step proof. Suppose a user comments /run-evals gsmK dsr1-fp4-b200-sglang 16:\n1. get-jobs job starts (top-level if passes — comment starts with /run-evals).\n2. 'Parse PR comment' executes; eval_name=gsmK falls through to the *) arm of the case, echoes 'unknown eval gsmK' to stderr, and exit 1s.\n3. The step's conclusion is 'failure'. Job status flips to failed for subsequent default-guard steps.\n4. 'Check author permissions' has its own if: without always() — skipped.\n5. 'Resolve immutable PR ref' — skipped.\n6. 'Reply with run link' — skipped (its if: reduces to implicit success() && (…) → false).\n7. approval and validate jobs are gated on needs.get-jobs.outputs.generator-args != '' / needs.get-jobs.result == 'success' and don't run.\n\nNet result on the PR: nothing. No comment, no failure indicator other than a red check in the Actions tab.\n\nAddressing the refutation. One refutation argued this mirrors pr-comment-sweep.yml's intentional pattern, so changing only /run-evals would diverge from the sibling. That's true in shape, but the asymmetry in validation surface makes the gap qualitatively different: pr-comment-sweep.yml's parse step has essentially no failure paths (the startsWith check already guards the job), whereas /run-evals introduces 5 realistic exit-1 paths. So while consistency is worth preserving for the auth/SHA-pin/reply skeleton, the strict parsing here introduces a new UX case that the sibling never has to handle. Flagging as a nit so the author can decide: add always() && + branch on steps.parse.outcome here only, or coordinate a matching update to pr-comment-sweep.yml. Either is consistent with the mirror intent — silently dropping user feedback is not.\n\nHow to fix. Add always() && to the reply step's if: and branch the message on the parse step's outcome, e.g.:\n\nyaml\n- name: Reply with run link\n if: ${{ always() && github.event_name == 'issue_comment' && startsWith(github.event.comment.body, '/run-evals') && github.repository_owner == 'SemiAnalysisAI' }}\n …\n script: |\n const parseOutcome = '${{ steps.parse.outcome }}';\n const body = parseOutcome === 'success'\n ? `@${author} Kicking off …`\n : `@${author} /run-evals failed to parse — see ${runUrl} for details.`;\n await github.rest.issues.createComment({ owner, repo, issue_number, body });\n\n\nLow-risk and keeps the success path identical.
| echo "valid evals: gsm8k | gpqa | swebench_lite" >&2 | ||
| exit 1 | ||
| fi | ||
|
|
||
| # Map <eval> -> (framework, task YAML). | ||
| case "$eval_name" in | ||
| gsm8k) framework="lm-eval"; task="utils/evals/gsm8k.yaml" ;; | ||
| gpqa|gpqa_diamond) framework="lm-eval"; task="utils/evals/gpqa_diamond.yaml" ;; | ||
| swebench|swebench_lite) framework="swebench"; task="utils/evals/swebench_lite.yaml" ;; | ||
| *) | ||
| echo "unknown eval '$eval_name' (valid: gsm8k, gpqa, swebench_lite)" >&2 |
There was a problem hiding this comment.
🟡 The usage error (line 58: gsm8k | gpqa | swebench_lite), the unknown-eval error (line 68: gsm8k, gpqa, swebench_lite), and the file-header comment (line 7: alias: swebench) all under-document the aliases the case statement actually accepts at lines 64-66: gpqa_diamond and swebench. Functional behavior is fine — /run-evals gpqa_diamond ... and /run-evals swebench ... both work — but users reading the help text won't discover these forms. Consider listing both names in the two error messages and adding gpqa_diamond alongside swebench in the header comment.
Extended reasoning...
The inconsistency
.github/workflows/run-evals.yml accepts five eval names in the case statement at lines 64-66:
case "$eval_name" in
gsm8k) framework="lm-eval"; task="utils/evals/gsm8k.yaml" ;;
gpqa|gpqa_diamond) framework="lm-eval"; task="utils/evals/gpqa_diamond.yaml" ;;
swebench|swebench_lite) framework="swebench"; task="utils/evals/swebench_lite.yaml" ;;But three pieces of help text describe a smaller set:
- Line 7 (header comment):
<eval> is one of: gsm8k | gpqa | swebench_lite (alias: swebench)— documents theswebenchalias but notgpqa_diamond. - Line 58 (usage error when args are missing):
valid evals: gsm8k | gpqa | swebench_lite— omits both aliases. - Line 68 (unknown-eval error in the case fallback):
unknown eval '...' (valid: gsm8k, gpqa, swebench_lite)— omits both aliases.
Step-by-step proof
- A user reads the header comment to learn the command syntax. They see
gsm8k | gpqa | swebench_lite (alias: swebench)and conclude there are three eval names plus one alias. - They run
/run-evals gpqa_diamond dsr1-fp4-b200-sglang 16. The case statement at line 65 matches viagpqa|gpqa_diamond, setsframework=lm-evalandtask=utils/evals/gpqa_diamond.yaml, and the run proceeds normally. - So the alias works — but the user only discovered it by reading the source, since none of the three pieces of help text mentioned it.
- Symmetrically, a different user who runs
/run-evals(no args) hits the usage error at line 58 and seesvalid evals: gsm8k | gpqa | swebench_lite— no hint thatswebench(alias) is also accepted.
Impact
Pure discoverability: the aliases are reachable only by reading the source. No functional bug — all five accepted forms work correctly. Worth a one-line cleanup since this file is brand new and the inconsistency will only grow as more evals/aliases are added.
Suggested fix
Update all three locations to list the same set, e.g. gsm8k | gpqa[_diamond] | swebench[_lite] (or spell them out fully). Keeping the case-statement order and the help-text order in sync prevents the next alias from drifting out of the docs again.
|
Going for a slightly different functionality here compared to the existing one |
- run-evals.yml: 'Reply with run link' now uses always() + branches on the parse step outcome, so a bad eval name / missing config-key / unrecognized HW token gets a helpful PR reply instead of silently doing nothing. - run-evals.yml: document the gpqa_diamond/swebench aliases in the header + both error messages (were only in the case statement). - e2e-tests/benchmark-tmpl/benchmark-multinode-tmpl: align the eval-framework input description across all three (drop the misleading 'recipe default' — the consumer falls back to a hardcoded lm-eval).
Adds the workflow-side plumbing so a PR comment can launch an eval-only run of a single recipe (no perf sweep):
/run-evals [conc] [master-config]
Inert by default (eval-framework defaults to lm-eval). The framework-dispatch code (run_eval EVAL_FRAMEWORK override + run_swebench_eval + swebench task/scorer) lives with the swebench PR and is checked out from the commented PR's head at runtime, so this can merge to main independently.