feat(statistics): McNemar power + required-N (pre-registration for paired-binary A/B)#256
Conversation
…ired-binary A/B) Completes the paired-binary family: mcnemarRequiredN (pairs needed for a target power, parametrised by the expected discordant-cell probabilities p10/p01) and mcnemarPower (power at a given n) — the paired-binary analogue of requiredSampleSize/pairedMde, which only cover continuous effects. Lachin (1992) asymptotic normal model. Monte-Carlo validated against the exact mcnemar test: at the recommended N, empirical power lands just under the asymptotic target (0.76 vs 0.80) — the expected exact-vs-asymptotic conservatism, documented as 'treat N as a lower bound'; null effect rejects at <=alpha. 7 tests; full stats suite 93 green.
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — f95edfbe
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-19T17:41:10Z
tangletools
left a comment
There was a problem hiding this comment.
🟢 Value Audit — sound
| Verdict | sound |
| Concerns | 0 (none) |
| Heuristic | 0.0s |
| Duplication | 0.0s |
| Interrogation | 56.8s (2 bridge agents) |
| Total | 56.8s |
💰 Value — sound
Adds Lachin-asymptotic McNemar power + required-N functions (the paired-binary counterparts to existing continuous-power primitives), well-integrated into the statistics module with 7 passing tests — clean, coherent, no better approach exists.
- What it does: Adds
mcnemarRequiredN(p10, p01, alpha?, power?)andmcnemarPower(p10, p01, nPairs, alpha?)tosrc/statistics.ts— Lachin (1992) asymptotic power analysis and sample-size planning for McNemar's paired-binary test.mcnemarRequiredNreturns the number of paired observations needed for a target power given expected discordant-cell probabilities;mcnemarPowerreturns the achieved power at a g - Goals it achieves: Closes the paired-binary gap in the power-analysis family. The existing
requiredSampleSize/pairedMdeonly cover continuous effects (Cohen's d, paired deltas). McNemar test users — working with pass/fail paired outcomes in eval — had no way to plan sample size or compute achieved power. This adds the direct analogue, enabling pre-registration planning for paired-binary A/B tests (how many pairs - Assessment: Coherent and well-grounded. Both functions sit in the
src/statistics.tspower-analysis section (line 663, directly afterpairedMde), use the samezQuantile/normalCdfhelpers, share the{alpha?, power?, twoSided?}options convention, follow the codebase's fail-loud pattern (throws on invalid p10/p01), and match the existingrequiredSampleSizeconvention of returningInfinityfor zero - Better / existing approach: none — this is the right approach. Searched
src/for any existing McNemar power/sample-size functions (mcnemar.*power,requiredN,nPairs,discordant.*power): found zero. The existing continuous-power primitives (requiredSampleSizefor independent groups,pairedMdefor paired continuous) operate on different parameter spaces (effect size in d-units vs discordant-cell probabilities). T - Model: opencode/deepseek/deepseek-v4-pro
- Bridge attempts: 1
🎯 Usefulness — sound
Completes the paired-binary power-analysis pair (mcnemarRequiredN / mcnemarPower) following the existing requiredSampleSize / pairedMde pattern for continuous tests — the natural pre-registration companion to the already-shipped mcnemar test.
- Integration: Wire-up is complete: both functions are exported from the barrel file (src/index.ts:363-364) and re-exported from the statistics module (src/statistics.ts:723,753). No in-repo caller exists yet, but
mcnemaritself also has zero in-repo callers — this is substrate (src/statistics.ts:981), expected to be consumed by agent-runtime and agent-knowledge. Same pattern asrequiredSampleSize/ `pairedM - Fit with existing patterns: Precisely follows the established power-analysis pattern. The continuous-test pair (
requiredSampleSizefor two-sample,pairedMdefor paired) lives at src/statistics.ts:671-707; the new functions fill the paired-binary gap at src/statistics.ts:723-773. Same opts shape (optional alpha/power/twoSided defaults), same Infinity/alpha semantics for zero-effect, same Math.ceil return for requiredN, sa - Real-world viability: Edge cases handled: impossible discordant probabilities throw (p10+p01 > 1, negative values), zero effect → Infinity (requiredN) or alpha (power), nPairs ≤ 0 → alpha. The
Math.max(0, ...)guard prevents sqrt of negative in the Lachin denominator for floating-point edge cases. Monte-Carlo validated against exact McNemar (per PR body and test suite at tests/statistics.test.ts:711-748). The asympto - Model: opencode/deepseek/deepseek-v4-pro
- Bridge attempts: 1
No concerns — sound change, no better or existing approach found. ✅
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
✅ No Blockers —
|
| deepseek | glm | aggregate | |
|---|---|---|---|
| Readiness | 89 | 83 | 83 |
| Confidence | 75 | 75 | 75 |
| Correctness | 89 | 83 | 83 |
| Security | 89 | 83 | 83 |
| Testing | 89 | 83 | 83 |
| Architecture | 89 | 83 | 83 |
Full multi-shot audit completed 3/3 planned shots over 3 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 3/3 planned shots over 3 changed files. Global verifier still owns final merge decision.
🟡 LOW Inconsistent error type for input validation — src/statistics.ts
Lines 732, 762: mcnemarRequiredN and mcnemarPower throw
new Error(...)for invalid inputs (p10+p01 > 1, negative probs). The file's newer validation convention usesValidationError(from ./errors) — see pairedTTest, wilcoxonSignedRank, corpusInterRaterAgreement, eProcess. The parentmcnemar(line 986) also usesError, so this is consistent within the paired-binary family, but a caller doinginstanceof ValidationErrorchecks would miss these. Consider switching to ValidationError for consistency with the file's prevailing convention, or accept as-is s
🟡 LOW NaN inputs slip past range validation (no finiteness guard) — src/statistics.ts
Sibling requiredSampleSize guards with Number.isFinite (src/statistics.ts:678), but mcnemarRequiredN (:731-733) and mcnemarPower (:761-763) only check
p10 < 0 || p01 < 0 || p10 + p01 > 1. SinceNaN < 0andNaN > 1both evaluate to false, a NaN p10/p01 (or NaN alpha/power) passes validation and propagates as a NaN return rather than throwing — inconsistent with the 'fail loud, no silent NaN' doctrine in CLAUDE.md. Impact is low: callers feeding NaN already have a bug, and NaN propagates visibly. Fix: addif (!Number.isFinite(p10) || !Number.isFinite(p01)) throw new Error(...)before the range check in both functions; optionally validate alpha/power ∈ (0,1) too.
🟡 LOW Optional alpha / twoSided:false branches untested — tests/statistics.test.ts
Both mcnemarRequiredN and mcnemarPower accept
alphaandtwoSidedoptions (src/statistics.ts:723-724, 753), but every test case uses the defaults (alpha=0.05, twoSided=true). A bug in the one-sided zAlpha path (zQuantile(1-alpha) vs 1-alpha/2) or in honoring a custom alpha would not be caught. Fix: add one assertion exercising twoSided:false (one-sided should require fewer pairs / yield higher power) and one with alpha=0.01.
🟡 LOW Variable naming is inverted relative to the values held — tests/statistics.test.ts
Line 718-720:
const big = mcnemarRequiredN({ p10: 0.4, p01: 0.05 })stores N for the big-effect case (≈19 pairs).const small = mcnemarRequiredN({ p10: 0.2, p01: 0.15 })stores N for the small-effect case (≈1585 pairs). The namesbig/smallmatch effect size but their values are the opposite (big holds small N, small holds big N). The assertionexpect(small).toBeGreaterThan(big)is correct. Suggestion: rename tonBigEffect/nSmallEffectornLarge/nSmallto match the N values directly.
🟡 LOW mcnemarPower validation asymmetric with requiredN — tests/statistics.test.ts
The throw test (line 745-748) covers p10+p01>1 only via mcnemarRequiredN and p10<0 only via mcnemarPower. The p01<0 and p10+p01>1 paths inside mcnemarPower, and p01<0 inside mcnemarRequiredN, share the same guard string but are not directly exercised. Minor; the regex assertion would catch a removed guard for the one case tested.
🟡 LOW nPairs<=0 early-return branch untested — tests/statistics.test.ts
mcnemarPower returns
alphawhen nPairs<=0 (src/statistics.ts), but no test covers it. A regression that threw or returned 0 on degenerate input would pass silently. Fix: add expect(mcnemarPower({p10:0.25,p01:0.05,nPairs:0})).toBe(0.05) or assert it matches alpha.
tangletools · 2026-06-19T17:48:40Z · trace
…hardening (#258) Version-only bump of the trio (package.json + python pyproject + __init__). Ships everything merged since 0.93.0: - #253 data-plane reliability/concurrency/scale hardening - #254 paired-binary + coding-eval estimators (McNemar, risk-diff, Wilson, pass@k) - #255 trace-store append/load race + collision-free rollover - #256 McNemar power + required-N - #257 single source of truth for span-timestamp parsing
Follow-up to #254 (which merged McNemar / paired risk-difference / Wilson / pass@k). That PR squash-merged before this commit landed on the branch, so the power calc didn't make it into main — re-homing it here.
Adds the paired-binary power pair, the analogue of
requiredSampleSize/pairedMde(which only cover continuous effects):mcnemarRequiredN({p10, p01, alpha?, power?})— pairs needed for a target power, parametrised by the expected discordant-cell probabilities.mcnemarPower({p10, p01, nPairs, alpha?})— power at a given n (the inverse).Lachin (1992) asymptotic normal model. Monte-Carlo validated against the exact
mcnemartest: at the recommended N, empirical power lands just under the asymptotic target (≈0.76 vs 0.80) — the expected exact-vs-asymptotic conservatism, documented as "treat N as a lower bound"; null effect rejects at ≤alpha. 7 tests; stats suite green; tsc + biome clean.