Skip to content

Add /run-evals comment command#1948

Open
adibarra wants to merge 2 commits into
mainfrom
feat/run-evals-command
Open

Add /run-evals comment command#1948
adibarra wants to merge 2 commits into
mainfrom
feat/run-evals-command

Conversation

@adibarra

Copy link
Copy Markdown
Collaborator

Adds the workflow-side plumbing so a PR comment can launch an eval-only run of a single recipe (no perf sweep):
/run-evals [conc] [master-config]

  • run-evals.yml: new issue_comment command (mirrors pr-comment-sweep.yml auth/ SHA-pin/reply); maps -> framework+task, infers nvidia/amd master config from the config-key HW token, builds 'test-config ... --evals-only', calls e2e-tests.yml with eval-framework/eval-task.
  • e2e-tests.yml: new eval-framework/eval-task inputs, threaded into the eval jobs.
  • benchmark-tmpl.yml / benchmark-multinode-tmpl.yml: matching inputs -> env (EVAL_FRAMEWORK / EVAL_TASKS_DIR).

Inert by default (eval-framework defaults to lm-eval). The framework-dispatch code (run_eval EVAL_FRAMEWORK override + run_swebench_eval + swebench task/scorer) lives with the swebench PR and is checked out from the commented PR's head at runtime, so this can merge to main independently.

…orkflow plumbing)

Adds the workflow-side plumbing so a PR comment can launch an eval-only run of a
single recipe (no perf sweep):
  /run-evals <eval> <config-key> [conc] [master-config]

- run-evals.yml: new issue_comment command (mirrors pr-comment-sweep.yml auth/
  SHA-pin/reply); maps <eval> -> framework+task, infers nvidia/amd master config
  from the config-key HW token, builds 'test-config ... --evals-only', calls
  e2e-tests.yml with eval-framework/eval-task.
- e2e-tests.yml: new eval-framework/eval-task inputs, threaded into the eval jobs.
- benchmark-tmpl.yml / benchmark-multinode-tmpl.yml: matching inputs -> env
  (EVAL_FRAMEWORK / EVAL_TASKS_DIR).

Inert by default (eval-framework defaults to lm-eval). The framework-dispatch
code (run_eval EVAL_FRAMEWORK override + run_swebench_eval + swebench task/scorer)
lives with the swebench PR and is checked out from the commented PR's head at
runtime, so this can merge to main independently.
@adibarra adibarra requested a review from a team June 27, 2026 02:51
@functionstackx

Copy link
Copy Markdown
Collaborator

@adibarra isnt there eval-only tag that does this?

Comment on lines +24 to +28
eval-framework:
description: "Eval framework (lm-eval | swebench). Overrides the recipe default."
required: false
type: string
default: "lm-eval"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The new eval-framework input is declared inconsistently across the three workflows: e2e-tests.yml (lines 24-28 and 53-57) uses default: "lm-eval" with description "Overrides the recipe default", while benchmark-tmpl.yml (lines 62-66) and benchmark-multinode-tmpl.yml (lines 94-98) use default: "" with description "Empty = recipe default." Since e2e-tests.yml unconditionally forwards inputs.eval-framework to the templates, callers that leave the field at default always send lm-eval downstream, so the templates' "Empty = recipe default" branch is unreachable through this entry point. Functionally harmless (consumer benchmark_lib.sh:974 does ${EVAL_FRAMEWORK:-lm-eval} so both empty and lm-eval resolve identically) — worth aligning the description and default across the three files for clarity.

Extended reasoning...

What the inconsistency is

The new eval-framework input is declared in three workflow files with two different defaults and two different descriptions:

File Default Description
.github/workflows/e2e-tests.yml (lines 24-28, 53-57) "lm-eval" "Eval framework (lm-eval | swebench). Overrides the recipe default."
.github/workflows/benchmark-tmpl.yml (lines 62-66) "" "Eval framework (lm-eval | swebench). Empty = recipe default."
.github/workflows/benchmark-multinode-tmpl.yml (lines 94-98) "" "Eval framework (lm-eval | swebench). Empty = recipe default."

What the consumer actually does

There is no per-recipe default for the eval framework. In benchmarks/benchmark_lib.sh:974:

local framework="${EVAL_FRAMEWORK:-lm-eval}"

So "recipe default" is really just a hard-coded lm-eval fallback. Neither description quite matches that: e2e-tests.yml suggests there is a recipe default being overridden, and the templates suggest empty resolves to a per-recipe default — but in reality empty and lm-eval resolve to the same thing.

Reachability of the templates' empty branch

Because e2e-tests.yml unconditionally forwards inputs.eval-framework to both templates (lines 183 and 319):

eval-framework: ${{ inputs.eval-framework }}

any caller that uses the e2e-tests.yml defaults sends lm-eval (not "") downstream. The "Empty = recipe default" code path documented on the template is therefore unreachable via e2e-tests.yml unless a user explicitly passes eval-framework="" in workflow_dispatch. The new run-evals.yml similarly only ever forwards lm-eval or swebench, never empty.

Addressing the refutation

One reviewer argued this is intentional layered cascading: the templates expose a "use my default" knob via empty, and e2e-tests.yml is one layer above that always supplies a concrete value. That framing is defensible, but two things weaken it: (1) the templates' description mentions a "recipe default" that does not actually exist in the consumer — there is no per-recipe configuration of the eval framework anywhere — and (2) since lm-eval and "" resolve to the same downstream value, the layered cascade has no observable effect. The remaining issue is purely documentation: a reader looking at the templates would reasonably expect an empty value to dispatch to a per-recipe default, when in fact it just falls through to the same hardcoded lm-eval.

Impact

None at runtime. This is documentation/UX clarity only. The behavior is correct in all current call paths.

Suggested fix

Pick one of:

  1. Make e2e-tests.yml use default: "" and the same "Empty = recipe default" description as the templates, so callers that omit the input naturally exercise the template's empty branch. Then update the description on all three to reflect the actual behavior, e.g. "Empty = lm-eval".
  2. Or, keep e2e-tests.yml as default: "lm-eval" and change the templates' description to "Empty = lm-eval (the default framework)", and acknowledge that nothing currently sends empty through this path.

Either way, aligning the three files removes the contradiction in what the field means.

Step-by-step proof

  1. Trigger run-evals.yml via /run-evals gsm8k <config-key> (or run e2e-tests.yml directly via workflow_dispatch leaving the eval-framework field at default).
  2. run-evals.yml sets eval-framework=lm-eval (from the case mapping) or e2e-tests.yml falls back to its declared default "lm-eval".
  3. e2e-tests.yml job test-sweep-evals forwards eval-framework: ${{ inputs.eval-framework }}"lm-eval" to benchmark-tmpl.yml.
  4. benchmark-tmpl.yml sets EVAL_FRAMEWORK: "lm-eval" in the job env.
  5. benchmark_lib.sh:974 evaluates ${EVAL_FRAMEWORK:-lm-eval}"lm-eval".
  6. The templates' empty default ("") is never observed; the "Empty = recipe default" docstring describes a code path that no caller exercises.

Comment on lines +138 to +141
- name: Reply with run link
if: ${{ github.event_name == 'issue_comment' && startsWith(github.event.comment.body, '/run-evals') && github.repository_owner == 'SemiAnalysisAI' }}
uses: actions/github-script@3a2844b7e9c422d3c10d287c895573f7108da1b3 # v9.0.0
continue-on-error: true

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 On parse failures, the 'Reply with run link' step is skipped (its if: lacks always()), so legitimate users who typo an eval name, forget the config-key, or use an unrecognized hardware token get no PR comment back and must dig into Actions logs to find out why /run-evals did nothing. Severity: nit (UX-only). Consider if: ${{ always() && ... }} plus a branch on steps.parse.outcome so failures still produce a reply explaining the error.

Extended reasoning...

What the bug is. The 'Reply with run link' step at .github/workflows/run-evals.yml:138-141 uses an if: expression with no status function (success()/failure()/always()):\n\nyaml\nif: ${{ github.event_name == 'issue_comment' && startsWith(github.event.comment.body, '/run-evals') && github.repository_owner == 'SemiAnalysisAI' }}\n\n\nPer GitHub Actions semantics, when an if: omits any status function it gets an implicit success() && prepended. So this step only runs when every previous step in the job succeeded.\n\nThe triggering code path. The 'Parse PR comment' step runs under set -euo pipefail and explicitly exit 1s on five validation paths:\n\n1. no /run-evals line at comment start;\n2. missing eval_name or config_key;\n3. eval_name not in the case (gsm8k / gpqa / swebench_lite);\n4. conc not matching ^[1-9][0-9]*$;\n5. no recognized hardware token in config_key and no 4th-arg master override.\n\nAny of these flips the parse step to failure, and the implicit-success guard then skips the reply step. continue-on-error: true on the reply step doesn't help — that flag controls outbound failure propagation, not whether the step runs after an upstream failure.\n\nWhy it's user-visible here. /run-evals has substantially stricter parsing than the /sweep workflow it mirrors. /sweep effectively forwards the raw args, so its parse step almost never fails. /run-evals has 5 distinct exit-1 paths, all of which are realistic user mistakes:\n\n- /run-evals gsm8k (forgot config-key) → exits 1 with usage: to stderr.\n- /run-evals gsmK dsr1-fp4-b200-sglang (typo) → exits 1 with 'unknown eval'.\n- /run-evals gsm8k my-custom-config (no recognized hw token, no 4th-arg) → exits 1 with 'cannot infer platform'.\n\nIn every case the user sees nothing happen on the PR. They have to notice the failed run in the Actions tab and open the log to see what they got wrong.\n\nStep-by-step proof. Suppose a user comments /run-evals gsmK dsr1-fp4-b200-sglang 16:\n1. get-jobs job starts (top-level if passes — comment starts with /run-evals).\n2. 'Parse PR comment' executes; eval_name=gsmK falls through to the *) arm of the case, echoes 'unknown eval gsmK' to stderr, and exit 1s.\n3. The step's conclusion is 'failure'. Job status flips to failed for subsequent default-guard steps.\n4. 'Check author permissions' has its own if: without always() — skipped.\n5. 'Resolve immutable PR ref' — skipped.\n6. 'Reply with run link' — skipped (its if: reduces to implicit success() && (…) → false).\n7. approval and validate jobs are gated on needs.get-jobs.outputs.generator-args != '' / needs.get-jobs.result == 'success' and don't run.\n\nNet result on the PR: nothing. No comment, no failure indicator other than a red check in the Actions tab.\n\nAddressing the refutation. One refutation argued this mirrors pr-comment-sweep.yml's intentional pattern, so changing only /run-evals would diverge from the sibling. That's true in shape, but the asymmetry in validation surface makes the gap qualitatively different: pr-comment-sweep.yml's parse step has essentially no failure paths (the startsWith check already guards the job), whereas /run-evals introduces 5 realistic exit-1 paths. So while consistency is worth preserving for the auth/SHA-pin/reply skeleton, the strict parsing here introduces a new UX case that the sibling never has to handle. Flagging as a nit so the author can decide: add always() && + branch on steps.parse.outcome here only, or coordinate a matching update to pr-comment-sweep.yml. Either is consistent with the mirror intent — silently dropping user feedback is not.\n\nHow to fix. Add always() && to the reply step's if: and branch the message on the parse step's outcome, e.g.:\n\nyaml\n- name: Reply with run link\n if: ${{ always() && github.event_name == 'issue_comment' && startsWith(github.event.comment.body, '/run-evals') && github.repository_owner == 'SemiAnalysisAI' }}\n …\n script: |\n const parseOutcome = '${{ steps.parse.outcome }}';\n const body = parseOutcome === 'success'\n ? `@${author} Kicking off …`\n : `@${author} /run-evals failed to parse — see ${runUrl} for details.`;\n await github.rest.issues.createComment({ owner, repo, issue_number, body });\n\n\nLow-risk and keeps the success path identical.

Comment thread .github/workflows/run-evals.yml Outdated
Comment on lines +58 to +68
echo "valid evals: gsm8k | gpqa | swebench_lite" >&2
exit 1
fi

# Map <eval> -> (framework, task YAML).
case "$eval_name" in
gsm8k) framework="lm-eval"; task="utils/evals/gsm8k.yaml" ;;
gpqa|gpqa_diamond) framework="lm-eval"; task="utils/evals/gpqa_diamond.yaml" ;;
swebench|swebench_lite) framework="swebench"; task="utils/evals/swebench_lite.yaml" ;;
*)
echo "unknown eval '$eval_name' (valid: gsm8k, gpqa, swebench_lite)" >&2

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The usage error (line 58: gsm8k | gpqa | swebench_lite), the unknown-eval error (line 68: gsm8k, gpqa, swebench_lite), and the file-header comment (line 7: alias: swebench) all under-document the aliases the case statement actually accepts at lines 64-66: gpqa_diamond and swebench. Functional behavior is fine — /run-evals gpqa_diamond ... and /run-evals swebench ... both work — but users reading the help text won't discover these forms. Consider listing both names in the two error messages and adding gpqa_diamond alongside swebench in the header comment.

Extended reasoning...

The inconsistency

.github/workflows/run-evals.yml accepts five eval names in the case statement at lines 64-66:

case "$eval_name" in
  gsm8k)                  framework="lm-eval";  task="utils/evals/gsm8k.yaml" ;;
  gpqa|gpqa_diamond)      framework="lm-eval";  task="utils/evals/gpqa_diamond.yaml" ;;
  swebench|swebench_lite) framework="swebench"; task="utils/evals/swebench_lite.yaml" ;;

But three pieces of help text describe a smaller set:

  • Line 7 (header comment): <eval> is one of: gsm8k | gpqa | swebench_lite (alias: swebench) — documents the swebench alias but not gpqa_diamond.
  • Line 58 (usage error when args are missing): valid evals: gsm8k | gpqa | swebench_lite — omits both aliases.
  • Line 68 (unknown-eval error in the case fallback): unknown eval '...' (valid: gsm8k, gpqa, swebench_lite) — omits both aliases.

Step-by-step proof

  1. A user reads the header comment to learn the command syntax. They see gsm8k | gpqa | swebench_lite (alias: swebench) and conclude there are three eval names plus one alias.
  2. They run /run-evals gpqa_diamond dsr1-fp4-b200-sglang 16. The case statement at line 65 matches via gpqa|gpqa_diamond, sets framework=lm-eval and task=utils/evals/gpqa_diamond.yaml, and the run proceeds normally.
  3. So the alias works — but the user only discovered it by reading the source, since none of the three pieces of help text mentioned it.
  4. Symmetrically, a different user who runs /run-evals (no args) hits the usage error at line 58 and sees valid evals: gsm8k | gpqa | swebench_lite — no hint that swebench (alias) is also accepted.

Impact

Pure discoverability: the aliases are reachable only by reading the source. No functional bug — all five accepted forms work correctly. Worth a one-line cleanup since this file is brand new and the inconsistency will only grow as more evals/aliases are added.

Suggested fix

Update all three locations to list the same set, e.g. gsm8k | gpqa[_diamond] | swebench[_lite] (or spell them out fully). Keeping the case-statement order and the help-text order in sync prevents the next alias from drifting out of the docs again.

@adibarra

Copy link
Copy Markdown
Collaborator Author

Going for a slightly different functionality here compared to the existing one

- run-evals.yml: 'Reply with run link' now uses always() + branches on the parse
  step outcome, so a bad eval name / missing config-key / unrecognized HW token
  gets a helpful PR reply instead of silently doing nothing.
- run-evals.yml: document the gpqa_diamond/swebench aliases in the header + both
  error messages (were only in the case statement).
- e2e-tests/benchmark-tmpl/benchmark-multinode-tmpl: align the eval-framework
  input description across all three (drop the misleading 'recipe default' — the
  consumer falls back to a hardcoded lm-eval).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants