feat(statistics): McNemar power + required-N (pre-registration for paired-binary A/B) by drewstone · Pull Request #256 · tangle-network/agent-eval

drewstone · 2026-06-19T17:41:03Z

Follow-up to #254 (which merged McNemar / paired risk-difference / Wilson / pass@k). That PR squash-merged before this commit landed on the branch, so the power calc didn't make it into main — re-homing it here.

Adds the paired-binary power pair, the analogue of requiredSampleSize/pairedMde (which only cover continuous effects):

mcnemarRequiredN({p10, p01, alpha?, power?}) — pairs needed for a target power, parametrised by the expected discordant-cell probabilities.
mcnemarPower({p10, p01, nPairs, alpha?}) — power at a given n (the inverse).

Lachin (1992) asymptotic normal model. Monte-Carlo validated against the exact mcnemar test: at the recommended N, empirical power lands just under the asymptotic target (≈0.76 vs 0.80) — the expected exact-vs-asymptotic conservatism, documented as "treat N as a lower bound"; null effect rejects at ≤alpha. 7 tests; stats suite green; tsc + biome clean.

…ired-binary A/B) Completes the paired-binary family: mcnemarRequiredN (pairs needed for a target power, parametrised by the expected discordant-cell probabilities p10/p01) and mcnemarPower (power at a given n) — the paired-binary analogue of requiredSampleSize/pairedMde, which only cover continuous effects. Lachin (1992) asymptotic normal model. Monte-Carlo validated against the exact mcnemar test: at the recommended N, empirical power lands just under the asymptotic target (0.76 vs 0.80) — the expected exact-vs-asymptotic conservatism, documented as 'treat N as a lower bound'; null effect rejects at <=alpha. 7 tests; full stats suite 93 green.

tangletools

✅ Auto-approved PR — `f95edfbe`

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-19T17:41:10Z}

tangletools

🟢 Value Audit — sound


Verdict	sound
Concerns	0 (none)
Heuristic	0.0s
Duplication	0.0s
Interrogation	56.8s (2 bridge agents)
Total	56.8s

💰 Value — sound

Adds Lachin-asymptotic McNemar power + required-N functions (the paired-binary counterparts to existing continuous-power primitives), well-integrated into the statistics module with 7 passing tests — clean, coherent, no better approach exists.

What it does: Adds mcnemarRequiredN(p10, p01, alpha?, power?) and mcnemarPower(p10, p01, nPairs, alpha?) to src/statistics.ts — Lachin (1992) asymptotic power analysis and sample-size planning for McNemar's paired-binary test. mcnemarRequiredN returns the number of paired observations needed for a target power given expected discordant-cell probabilities; mcnemarPower returns the achieved power at a g
Goals it achieves: Closes the paired-binary gap in the power-analysis family. The existing requiredSampleSize/pairedMde only cover continuous effects (Cohen's d, paired deltas). McNemar test users — working with pass/fail paired outcomes in eval — had no way to plan sample size or compute achieved power. This adds the direct analogue, enabling pre-registration planning for paired-binary A/B tests (how many pairs
Assessment: Coherent and well-grounded. Both functions sit in the src/statistics.ts power-analysis section (line 663, directly after pairedMde), use the same zQuantile/normalCdf helpers, share the {alpha?, power?, twoSided?} options convention, follow the codebase's fail-loud pattern (throws on invalid p10/p01), and match the existing requiredSampleSize convention of returning Infinity for zero
Better / existing approach: none — this is the right approach. Searched src/ for any existing McNemar power/sample-size functions (mcnemar.*power, requiredN, nPairs, discordant.*power): found zero. The existing continuous-power primitives (requiredSampleSize for independent groups, pairedMde for paired continuous) operate on different parameter spaces (effect size in d-units vs discordant-cell probabilities). T
Model: opencode/deepseek/deepseek-v4-pro
Bridge attempts: 1

🎯 Usefulness — sound

Completes the paired-binary power-analysis pair (mcnemarRequiredN / mcnemarPower) following the existing requiredSampleSize / pairedMde pattern for continuous tests — the natural pre-registration companion to the already-shipped mcnemar test.

Integration: Wire-up is complete: both functions are exported from the barrel file (src/index.ts:363-364) and re-exported from the statistics module (src/statistics.ts:723,753). No in-repo caller exists yet, but mcnemar itself also has zero in-repo callers — this is substrate (src/statistics.ts:981), expected to be consumed by agent-runtime and agent-knowledge. Same pattern as requiredSampleSize / `pairedM
Fit with existing patterns: Precisely follows the established power-analysis pattern. The continuous-test pair (requiredSampleSize for two-sample, pairedMde for paired) lives at src/statistics.ts:671-707; the new functions fill the paired-binary gap at src/statistics.ts:723-773. Same opts shape (optional alpha/power/twoSided defaults), same Infinity/alpha semantics for zero-effect, same Math.ceil return for requiredN, sa
Real-world viability: Edge cases handled: impossible discordant probabilities throw (p10+p01 > 1, negative values), zero effect → Infinity (requiredN) or alpha (power), nPairs ≤ 0 → alpha. The Math.max(0, ...) guard prevents sqrt of negative in the Lachin denominator for floating-point edge cases. Monte-Carlo validated against exact McNemar (per PR body and test suite at tests/statistics.test.ts:711-748). The asympto
Model: opencode/deepseek/deepseek-v4-pro
Bridge attempts: 1

No concerns — sound change, no better or existing approach found. ✅

What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass	What it asks
Heuristic	Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication	Do added function/class names already exist elsewhere in the repo?
Value Audit	What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit	Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

_{value-audit · 20260619T174433Z}

tangletools · 2026-06-19T17:48:42Z

✅ No Blockers — `f95edfbe`

Readiness 83/100 · Confidence 75/100 · 6 findings (6 low)

	deepseek	glm	aggregate
Readiness	89	83	83
Confidence	75	75	75
Correctness	89	83	83
Security	89	83	83
Testing	89	83	83
Architecture	89	83	83

Full multi-shot audit completed 3/3 planned shots over 3 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 3/3 planned shots over 3 changed files. Global verifier still owns final merge decision.

🟡 LOW Inconsistent error type for input validation — src/statistics.ts

Lines 732, 762: mcnemarRequiredN and mcnemarPower throw new Error(...) for invalid inputs (p10+p01 > 1, negative probs). The file's newer validation convention uses ValidationError (from ./errors) — see pairedTTest, wilcoxonSignedRank, corpusInterRaterAgreement, eProcess. The parent mcnemar (line 986) also uses Error, so this is consistent within the paired-binary family, but a caller doing instanceof ValidationError checks would miss these. Consider switching to ValidationError for consistency with the file's prevailing convention, or accept as-is s

🟡 LOW NaN inputs slip past range validation (no finiteness guard) — src/statistics.ts

Sibling requiredSampleSize guards with Number.isFinite (src/statistics.ts:678), but mcnemarRequiredN (:731-733) and mcnemarPower (:761-763) only check p10 < 0 || p01 < 0 || p10 + p01 > 1. Since NaN < 0 and NaN > 1 both evaluate to false, a NaN p10/p01 (or NaN alpha/power) passes validation and propagates as a NaN return rather than throwing — inconsistent with the 'fail loud, no silent NaN' doctrine in CLAUDE.md. Impact is low: callers feeding NaN already have a bug, and NaN propagates visibly. Fix: add if (!Number.isFinite(p10) || !Number.isFinite(p01)) throw new Error(...) before the range check in both functions; optionally validate alpha/power ∈ (0,1) too.

🟡 LOW Optional alpha / twoSided:false branches untested — tests/statistics.test.ts

Both mcnemarRequiredN and mcnemarPower accept alpha and twoSided options (src/statistics.ts:723-724, 753), but every test case uses the defaults (alpha=0.05, twoSided=true). A bug in the one-sided zAlpha path (zQuantile(1-alpha) vs 1-alpha/2) or in honoring a custom alpha would not be caught. Fix: add one assertion exercising twoSided:false (one-sided should require fewer pairs / yield higher power) and one with alpha=0.01.

🟡 LOW Variable naming is inverted relative to the values held — tests/statistics.test.ts

Line 718-720: const big = mcnemarRequiredN({ p10: 0.4, p01: 0.05 }) stores N for the big-effect case (≈19 pairs). const small = mcnemarRequiredN({ p10: 0.2, p01: 0.15 }) stores N for the small-effect case (≈1585 pairs). The names big/small match effect size but their values are the opposite (big holds small N, small holds big N). The assertion expect(small).toBeGreaterThan(big) is correct. Suggestion: rename to nBigEffect/nSmallEffect or nLarge/nSmall to match the N values directly.

🟡 LOW mcnemarPower validation asymmetric with requiredN — tests/statistics.test.ts

The throw test (line 745-748) covers p10+p01>1 only via mcnemarRequiredN and p10<0 only via mcnemarPower. The p01<0 and p10+p01>1 paths inside mcnemarPower, and p01<0 inside mcnemarRequiredN, share the same guard string but are not directly exercised. Minor; the regex assertion would catch a removed guard for the one case tested.

🟡 LOW nPairs<=0 early-return branch untested — tests/statistics.test.ts

mcnemarPower returns alpha when nPairs<=0 (src/statistics.ts), but no test covers it. A regression that threw or returned 0 on degenerate input would pass silently. Fix: add expect(mcnemarPower({p10:0.25,p01:0.05,nPairs:0})).toBe(0.05) or assert it matches alpha.

_{tangletools · 2026-06-19T17:48:40Z · trace}

…hardening (#258) Version-only bump of the trio (package.json + python pyproject + __init__). Ships everything merged since 0.93.0: - #253 data-plane reliability/concurrency/scale hardening - #254 paired-binary + coding-eval estimators (McNemar, risk-diff, Wilson, pass@k) - #255 trace-store append/load race + collision-free rollover - #256 McNemar power + required-N - #257 single source of truth for span-timestamp parsing

tangletools approved these changes Jun 19, 2026

View reviewed changes

tangletools reviewed Jun 19, 2026

View reviewed changes

drewstone merged commit 45e7cbf into main Jun 19, 2026
1 check passed

drewstone mentioned this pull request Jun 19, 2026

chore(release): 0.94.0 #258

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(statistics): McNemar power + required-N (pre-registration for paired-binary A/B)#256

feat(statistics): McNemar power + required-N (pre-registration for paired-binary A/B)#256
drewstone merged 1 commit into
mainfrom
feat/mcnemar-power

drewstone commented Jun 19, 2026

Uh oh!

tangletools left a comment

Uh oh!

tangletools left a comment

Uh oh!

tangletools commented Jun 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

drewstone commented Jun 19, 2026

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Auto-approved PR — f95edfbe

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

🟢 Value Audit — sound

💰 Value — sound

🎯 Usefulness — sound

Uh oh!

tangletools commented Jun 19, 2026

✅ No Blockers — f95edfbe

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

✅ Auto-approved PR — `f95edfbe`

✅ No Blockers — `f95edfbe`