Skip to content

feat(statistics): McNemar power + required-N (pre-registration for paired-binary A/B)#256

Merged
drewstone merged 1 commit into
mainfrom
feat/mcnemar-power
Jun 19, 2026
Merged

feat(statistics): McNemar power + required-N (pre-registration for paired-binary A/B)#256
drewstone merged 1 commit into
mainfrom
feat/mcnemar-power

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

Follow-up to #254 (which merged McNemar / paired risk-difference / Wilson / pass@k). That PR squash-merged before this commit landed on the branch, so the power calc didn't make it into main — re-homing it here.

Adds the paired-binary power pair, the analogue of requiredSampleSize/pairedMde (which only cover continuous effects):

  • mcnemarRequiredN({p10, p01, alpha?, power?}) — pairs needed for a target power, parametrised by the expected discordant-cell probabilities.
  • mcnemarPower({p10, p01, nPairs, alpha?}) — power at a given n (the inverse).

Lachin (1992) asymptotic normal model. Monte-Carlo validated against the exact mcnemar test: at the recommended N, empirical power lands just under the asymptotic target (≈0.76 vs 0.80) — the expected exact-vs-asymptotic conservatism, documented as "treat N as a lower bound"; null effect rejects at ≤alpha. 7 tests; stats suite green; tsc + biome clean.

…ired-binary A/B)

Completes the paired-binary family: mcnemarRequiredN (pairs needed for a target power, parametrised by the expected discordant-cell probabilities p10/p01) and mcnemarPower (power at a given n) — the paired-binary analogue of requiredSampleSize/pairedMde, which only cover continuous effects. Lachin (1992) asymptotic normal model.

Monte-Carlo validated against the exact mcnemar test: at the recommended N, empirical power lands just under the asymptotic target (0.76 vs 0.80) — the expected exact-vs-asymptotic conservatism, documented as 'treat N as a lower bound'; null effect rejects at <=alpha. 7 tests; full stats suite 93 green.

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — f95edfbe

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-19T17:41:10Z

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Value Audit — sound

Verdict sound
Concerns 0 (none)
Heuristic 0.0s
Duplication 0.0s
Interrogation 56.8s (2 bridge agents)
Total 56.8s

💰 Value — sound

Adds Lachin-asymptotic McNemar power + required-N functions (the paired-binary counterparts to existing continuous-power primitives), well-integrated into the statistics module with 7 passing tests — clean, coherent, no better approach exists.

  • What it does: Adds mcnemarRequiredN(p10, p01, alpha?, power?) and mcnemarPower(p10, p01, nPairs, alpha?) to src/statistics.ts — Lachin (1992) asymptotic power analysis and sample-size planning for McNemar's paired-binary test. mcnemarRequiredN returns the number of paired observations needed for a target power given expected discordant-cell probabilities; mcnemarPower returns the achieved power at a g
  • Goals it achieves: Closes the paired-binary gap in the power-analysis family. The existing requiredSampleSize/pairedMde only cover continuous effects (Cohen's d, paired deltas). McNemar test users — working with pass/fail paired outcomes in eval — had no way to plan sample size or compute achieved power. This adds the direct analogue, enabling pre-registration planning for paired-binary A/B tests (how many pairs
  • Assessment: Coherent and well-grounded. Both functions sit in the src/statistics.ts power-analysis section (line 663, directly after pairedMde), use the same zQuantile/normalCdf helpers, share the {alpha?, power?, twoSided?} options convention, follow the codebase's fail-loud pattern (throws on invalid p10/p01), and match the existing requiredSampleSize convention of returning Infinity for zero
  • Better / existing approach: none — this is the right approach. Searched src/ for any existing McNemar power/sample-size functions (mcnemar.*power, requiredN, nPairs, discordant.*power): found zero. The existing continuous-power primitives (requiredSampleSize for independent groups, pairedMde for paired continuous) operate on different parameter spaces (effect size in d-units vs discordant-cell probabilities). T
  • Model: opencode/deepseek/deepseek-v4-pro
  • Bridge attempts: 1

🎯 Usefulness — sound

Completes the paired-binary power-analysis pair (mcnemarRequiredN / mcnemarPower) following the existing requiredSampleSize / pairedMde pattern for continuous tests — the natural pre-registration companion to the already-shipped mcnemar test.

  • Integration: Wire-up is complete: both functions are exported from the barrel file (src/index.ts:363-364) and re-exported from the statistics module (src/statistics.ts:723,753). No in-repo caller exists yet, but mcnemar itself also has zero in-repo callers — this is substrate (src/statistics.ts:981), expected to be consumed by agent-runtime and agent-knowledge. Same pattern as requiredSampleSize / `pairedM
  • Fit with existing patterns: Precisely follows the established power-analysis pattern. The continuous-test pair (requiredSampleSize for two-sample, pairedMde for paired) lives at src/statistics.ts:671-707; the new functions fill the paired-binary gap at src/statistics.ts:723-773. Same opts shape (optional alpha/power/twoSided defaults), same Infinity/alpha semantics for zero-effect, same Math.ceil return for requiredN, sa
  • Real-world viability: Edge cases handled: impossible discordant probabilities throw (p10+p01 > 1, negative values), zero effect → Infinity (requiredN) or alpha (power), nPairs ≤ 0 → alpha. The Math.max(0, ...) guard prevents sqrt of negative in the Lachin denominator for floating-point edge cases. Monte-Carlo validated against exact McNemar (per PR body and test suite at tests/statistics.test.ts:711-748). The asympto
  • Model: opencode/deepseek/deepseek-v4-pro
  • Bridge attempts: 1

No concerns — sound change, no better or existing approach found. ✅


What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass What it asks
Heuristic Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication Do added function/class names already exist elsewhere in the repo?
Value Audit What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

value-audit · 20260619T174433Z

@tangletools

Copy link
Copy Markdown
Contributor

✅ No Blockers — f95edfbe

Readiness 83/100 · Confidence 75/100 · 6 findings (6 low)

deepseek glm aggregate
Readiness 89 83 83
Confidence 75 75 75
Correctness 89 83 83
Security 89 83 83
Testing 89 83 83
Architecture 89 83 83

Full multi-shot audit completed 3/3 planned shots over 3 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 3/3 planned shots over 3 changed files. Global verifier still owns final merge decision.

🟡 LOW Inconsistent error type for input validation — src/statistics.ts

Lines 732, 762: mcnemarRequiredN and mcnemarPower throw new Error(...) for invalid inputs (p10+p01 > 1, negative probs). The file's newer validation convention uses ValidationError (from ./errors) — see pairedTTest, wilcoxonSignedRank, corpusInterRaterAgreement, eProcess. The parent mcnemar (line 986) also uses Error, so this is consistent within the paired-binary family, but a caller doing instanceof ValidationError checks would miss these. Consider switching to ValidationError for consistency with the file's prevailing convention, or accept as-is s

🟡 LOW NaN inputs slip past range validation (no finiteness guard) — src/statistics.ts

Sibling requiredSampleSize guards with Number.isFinite (src/statistics.ts:678), but mcnemarRequiredN (:731-733) and mcnemarPower (:761-763) only check p10 < 0 || p01 < 0 || p10 + p01 > 1. Since NaN < 0 and NaN > 1 both evaluate to false, a NaN p10/p01 (or NaN alpha/power) passes validation and propagates as a NaN return rather than throwing — inconsistent with the 'fail loud, no silent NaN' doctrine in CLAUDE.md. Impact is low: callers feeding NaN already have a bug, and NaN propagates visibly. Fix: add if (!Number.isFinite(p10) || !Number.isFinite(p01)) throw new Error(...) before the range check in both functions; optionally validate alpha/power ∈ (0,1) too.

🟡 LOW Optional alpha / twoSided:false branches untested — tests/statistics.test.ts

Both mcnemarRequiredN and mcnemarPower accept alpha and twoSided options (src/statistics.ts:723-724, 753), but every test case uses the defaults (alpha=0.05, twoSided=true). A bug in the one-sided zAlpha path (zQuantile(1-alpha) vs 1-alpha/2) or in honoring a custom alpha would not be caught. Fix: add one assertion exercising twoSided:false (one-sided should require fewer pairs / yield higher power) and one with alpha=0.01.

🟡 LOW Variable naming is inverted relative to the values held — tests/statistics.test.ts

Line 718-720: const big = mcnemarRequiredN({ p10: 0.4, p01: 0.05 }) stores N for the big-effect case (≈19 pairs). const small = mcnemarRequiredN({ p10: 0.2, p01: 0.15 }) stores N for the small-effect case (≈1585 pairs). The names big/small match effect size but their values are the opposite (big holds small N, small holds big N). The assertion expect(small).toBeGreaterThan(big) is correct. Suggestion: rename to nBigEffect/nSmallEffect or nLarge/nSmall to match the N values directly.

🟡 LOW mcnemarPower validation asymmetric with requiredN — tests/statistics.test.ts

The throw test (line 745-748) covers p10+p01>1 only via mcnemarRequiredN and p10<0 only via mcnemarPower. The p01<0 and p10+p01>1 paths inside mcnemarPower, and p01<0 inside mcnemarRequiredN, share the same guard string but are not directly exercised. Minor; the regex assertion would catch a removed guard for the one case tested.

🟡 LOW nPairs<=0 early-return branch untested — tests/statistics.test.ts

mcnemarPower returns alpha when nPairs<=0 (src/statistics.ts), but no test covers it. A regression that threw or returned 0 on degenerate input would pass silently. Fix: add expect(mcnemarPower({p10:0.25,p01:0.05,nPairs:0})).toBe(0.05) or assert it matches alpha.


tangletools · 2026-06-19T17:48:40Z · trace

@drewstone drewstone merged commit 45e7cbf into main Jun 19, 2026
1 check passed
@drewstone drewstone mentioned this pull request Jun 19, 2026
drewstone added a commit that referenced this pull request Jun 19, 2026
…hardening (#258)

Version-only bump of the trio (package.json + python pyproject + __init__).
Ships everything merged since 0.93.0:
- #253 data-plane reliability/concurrency/scale hardening
- #254 paired-binary + coding-eval estimators (McNemar, risk-diff, Wilson, pass@k)
- #255 trace-store append/load race + collision-free rollover
- #256 McNemar power + required-N
- #257 single source of truth for span-timestamp parsing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants