Add rapidfuzz-based word-level pass (default) by borgholt · Pull Request #21 · corticph/error-align

Lasse Borgholt (borgholt) · 2026-06-22T13:20:51Z

Summary

The word-level pass anchors token-level matches before beam search, so only the ambiguous spans between anchors get beam-searched. The existing pass (align_with_word_level_pass) builds the full backtrace graph and uses only matches common to all optimal paths (get_unambiguous_node_matches) — safe, but expensive (it materializes the whole DP backtrace graph).

This PR adds an alternative pass that takes the matches from a single optimal Levenshtein alignment via rapidfuzz, and makes it the default. The catch: rapidfuzz only emits the non-match operations (insert/delete/replace), so matches are inferred as the complement of the edited token indices (the same trick already used in baselines/rapidfuzz_word_alignment.py).

Selectable via a new word_level_method arg on error_align():

"rapidfuzz" (default) — matches from one optimal Levenshtein alignment.
"unambiguous" — the existing graph-based pass.

Tradeoff

rapidfuzz fixes anchors from one optimal alignment rather than only the provably-unambiguous ones, so it commits to more split points (faster, smaller beam-search spans) but may anchor a match the full graph pass would have left for beam search. Match semantics are identical in both paths (exact equality of normalized tokens), so simple cases agree.

On a long earnings21 example the rapidfuzz pass ran ~34× faster (1.3 s vs 45 s).

Changes

Add align_with_rapidfuzz_word_level_pass + get_rapidfuzz_match_indices.
Extract the shared subspan-extraction loop into align_from_match_indices (used by both passes).
Dispatch on word_level_method in error_align(), with a clear error for unknown values.
Promote rapidfuzz to a core runtime dependency (was evaluation-only, Python-3.12-gated).
Bump version to 0.1.0b10.

Tests

test_error_align parametrized over both methods (same expected op sequence).
Unit test for get_rapidfuzz_match_indices (complement inference, (hyp_idx, ref_idx) ordering).
Test that an unknown word_level_method raises ValueError.

All 17 tests pass (with typeguard); ruff clean.

🤖 Generated with Claude Code

The word-level pass anchors token-level matches before beam search so that only ambiguous spans between anchors are beam-searched. The existing pass builds the full backtrace graph and uses only matches common to all optimal paths (get_unambiguous_node_matches), which is safe but expensive. Add an alternative pass that takes the matches from a single optimal Levenshtein alignment via rapidfuzz. rapidfuzz only emits non-match ops, so matches are inferred as the complement of the edited token indices. This is selectable via the new word_level_method arg ("rapidfuzz" default, "unambiguous" for the graph-based pass). - Promote rapidfuzz to a core runtime dependency (was evaluation-only). - Extract the shared subspan-extraction loop into align_from_match_indices. - Bump version to 0.1.0b10. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

codecov · 2026-06-22T13:22:13Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 50.47%. Comparing base (902d623) to head (01f521a).

Additional details and impacted files

@@             Coverage Diff             @@
##             main      #21       +/-   ##
===========================================
- Coverage   93.13%   50.47%   -42.67%     
===========================================
  Files           9       14        +5     
  Lines         641     1369      +728     
  Branches      104      229      +125     
===========================================
+ Hits          597      691       +94     
- Misses         17      651      +634     
  Partials       27       27

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull request overview

This PR makes the word-level anchoring pass faster by defaulting to a rapidfuzz-derived single optimal Levenshtein alignment (instead of materializing the full DP backtrace graph), and adds a new word_level_method switch on error_align() to select between the two approaches.

Changes:

Add a rapidfuzz-based word-level pass (align_with_rapidfuzz_word_level_pass) and helper get_rapidfuzz_match_indices.
Extract shared “anchor spans then beam-search gaps” logic into align_from_match_indices and reuse it from both word-level passes.
Promote rapidfuzz to a core dependency, bump version, and expand tests to cover both methods + unknown method errors.

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 2 comments.

File	Description
`src/error_align/error_align.py`	Adds `word_level_method` dispatch, rapidfuzz-based anchoring, and shared anchor-to-subspan extraction.
`tests/test_default.py`	Parametrizes `test_error_align` over both methods and adds unit/validation tests for the new API.
`pyproject.toml`	Bumps version and moves `rapidfuzz` into core runtime dependencies.
`poetry.lock`	Updates lockfile to reflect rapidfuzz moving from optional to main dependency.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- align_from_match_indices: reorder the loop-tail unpacking to hyp-first to match the rest of the function (no behavior change). - Parametrize test_get_rapidfuzz_match_indices over replace/insert/delete so the complement logic is covered when ref/hyp lengths differ. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The baselines/ modules (POWER, etc.) are optional and largely untested. They only started counting toward coverage when rapidfuzz became a core dependency (#21), since that made error_align.baselines importable in CI — dropping reported coverage from ~93% to ~50%. The codecov.yml ignore glob ("src/error_align/baselines/*") never matched their installed path ("error_align/baselines/..."), so they slipped through. Omit them at the coverage layer (source-independent) and fix the codecov ignore glob to match any baselines path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Fix coverage double-counting between src/ and installed package CI installs the package non-editably (to compile the C++ extension), so imports resolve to site-packages while --cov=src separately measured the untouched src/ tree at 0% — doubling the denominator and roughly halving reported coverage (~93% -> ~50%). Measure a single source (--cov=error_align) and add [tool.coverage.paths] to merge the src/ and site-packages copies into one logical path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Omit research baselines from coverage measurement The baselines/ modules (POWER, etc.) are optional and largely untested. They only started counting toward coverage when rapidfuzz became a core dependency (#21), since that made error_align.baselines importable in CI — dropping reported coverage from ~93% to ~50%. The codecov.yml ignore glob ("src/error_align/baselines/*") never matched their installed path ("error_align/baselines/..."), so they slipped through. Omit them at the coverage layer (source-independent) and fix the codecov ignore glob to match any baselines path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Consolidate coverage config in .coveragerc (address Copilot review) .coveragerc takes precedence over pyproject.toml, so the [tool.coverage.*] config added earlier was silently ignored — two configs that could diverge. Remove it and keep .coveragerc as the single source of truth. Also fix its omit glob: src/error_align/baselines/* only matched the editable layout, so in CI (site-packages) baselines were not omitted by coverage.py and were excluded only by the codecov ignore. Use */baselines/* to match both. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Update citation to ICASSP 2026 publication Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Restore title casing in citation title Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings June 22, 2026 13:20

Copilot started reviewing on behalf of Lasse Borgholt (borgholt) June 22, 2026 13:21 View session

Copilot AI reviewed Jun 22, 2026

View reviewed changes

Comment thread src/error_align/error_align.py

Comment thread tests/test_default.py Outdated

Lasse Borgholt (borgholt) added the minor label Jun 22, 2026

Lasse Borgholt (borgholt) self-assigned this Jun 22, 2026

Lasse Borgholt (borgholt) and others added 6 commits June 22, 2026 15:26

Document rapidfuzz word-level pass in README

592fa37

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Note rapidfuzz pass is most effectful for longer examples

8f1df6e

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Reword README note for clarity

71b00f8

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Clarify rapidfuzz speedup comes mainly from cheaper anchor computation

5dfccc6

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Quantify rapidfuzz speedup (~30x) on longer examples

c5ea899

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Lasse Borgholt (borgholt) requested a review from Jakob Drachmann Havtorn (JakobHavtorn) June 22, 2026 13:48

Lasse Borgholt (borgholt) and others added 2 commits June 22, 2026 15:52

Tweak dataset name wording in README (Earnings-21)

b2951a5

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Remove emoji from README update note

01f521a

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Jakob Drachmann Havtorn (JakobHavtorn) approved these changes Jun 22, 2026

View reviewed changes

Lasse Borgholt (borgholt) merged commit 9c5a1d4 into main Jun 23, 2026
9 checks passed

Lasse Borgholt (borgholt) deleted the rapidfuzz-word-level-backtrace branch June 23, 2026 07:16

Lasse Borgholt (borgholt) mentioned this pull request Jun 23, 2026

Fix coverage double-counting (src/ vs installed package) #22

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add rapidfuzz-based word-level pass (default)#21

Add rapidfuzz-based word-level pass (default)#21
Lasse Borgholt (borgholt) merged 9 commits into
mainfrom
rapidfuzz-word-level-backtrace

Lasse Borgholt (borgholt) commented Jun 22, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 22, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Lasse Borgholt (borgholt) commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Tradeoff

Changes

Tests

Uh oh!

codecov Bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Lasse Borgholt (borgholt) commented Jun 22, 2026 •

edited

Loading

codecov Bot commented Jun 22, 2026 •

edited

Loading