Add rapidfuzz-based word-level pass (default)#21
Merged
Conversation
The word-level pass anchors token-level matches before beam search so that
only ambiguous spans between anchors are beam-searched. The existing pass
builds the full backtrace graph and uses only matches common to all optimal
paths (get_unambiguous_node_matches), which is safe but expensive.
Add an alternative pass that takes the matches from a single optimal
Levenshtein alignment via rapidfuzz. rapidfuzz only emits non-match ops, so
matches are inferred as the complement of the edited token indices. This is
selectable via the new word_level_method arg ("rapidfuzz" default,
"unambiguous" for the graph-based pass).
- Promote rapidfuzz to a core runtime dependency (was evaluation-only).
- Extract the shared subspan-extraction loop into align_from_match_indices.
- Bump version to 0.1.0b10.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #21 +/- ##
===========================================
- Coverage 93.13% 50.47% -42.67%
===========================================
Files 9 14 +5
Lines 641 1369 +728
Branches 104 229 +125
===========================================
+ Hits 597 691 +94
- Misses 17 651 +634
Partials 27 27 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Contributor
There was a problem hiding this comment.
Pull request overview
This PR makes the word-level anchoring pass faster by defaulting to a rapidfuzz-derived single optimal Levenshtein alignment (instead of materializing the full DP backtrace graph), and adds a new word_level_method switch on error_align() to select between the two approaches.
Changes:
- Add a rapidfuzz-based word-level pass (
align_with_rapidfuzz_word_level_pass) and helperget_rapidfuzz_match_indices. - Extract shared “anchor spans then beam-search gaps” logic into
align_from_match_indicesand reuse it from both word-level passes. - Promote
rapidfuzzto a core dependency, bump version, and expand tests to cover both methods + unknown method errors.
Reviewed changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
src/error_align/error_align.py |
Adds word_level_method dispatch, rapidfuzz-based anchoring, and shared anchor-to-subspan extraction. |
tests/test_default.py |
Parametrizes test_error_align over both methods and adds unit/validation tests for the new API. |
pyproject.toml |
Bumps version and moves rapidfuzz into core runtime dependencies. |
poetry.lock |
Updates lockfile to reflect rapidfuzz moving from optional to main dependency. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- align_from_match_indices: reorder the loop-tail unpacking to hyp-first to match the rest of the function (no behavior change). - Parametrize test_get_rapidfuzz_match_indices over replace/insert/delete so the complement logic is covered when ref/hyp lengths differ. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Jakob Drachmann Havtorn (JakobHavtorn)
approved these changes
Jun 22, 2026
Lasse Borgholt (borgholt)
added a commit
that referenced
this pull request
Jun 23, 2026
The baselines/ modules (POWER, etc.) are optional and largely untested. They only started counting toward coverage when rapidfuzz became a core dependency (#21), since that made error_align.baselines importable in CI — dropping reported coverage from ~93% to ~50%. The codecov.yml ignore glob ("src/error_align/baselines/*") never matched their installed path ("error_align/baselines/..."), so they slipped through. Omit them at the coverage layer (source-independent) and fix the codecov ignore glob to match any baselines path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Lasse Borgholt (borgholt)
added a commit
that referenced
this pull request
Jun 23, 2026
* Fix coverage double-counting between src/ and installed package CI installs the package non-editably (to compile the C++ extension), so imports resolve to site-packages while --cov=src separately measured the untouched src/ tree at 0% — doubling the denominator and roughly halving reported coverage (~93% -> ~50%). Measure a single source (--cov=error_align) and add [tool.coverage.paths] to merge the src/ and site-packages copies into one logical path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Omit research baselines from coverage measurement The baselines/ modules (POWER, etc.) are optional and largely untested. They only started counting toward coverage when rapidfuzz became a core dependency (#21), since that made error_align.baselines importable in CI — dropping reported coverage from ~93% to ~50%. The codecov.yml ignore glob ("src/error_align/baselines/*") never matched their installed path ("error_align/baselines/..."), so they slipped through. Omit them at the coverage layer (source-independent) and fix the codecov ignore glob to match any baselines path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Consolidate coverage config in .coveragerc (address Copilot review) .coveragerc takes precedence over pyproject.toml, so the [tool.coverage.*] config added earlier was silently ignored — two configs that could diverge. Remove it and keep .coveragerc as the single source of truth. Also fix its omit glob: src/error_align/baselines/* only matched the editable layout, so in CI (site-packages) baselines were not omitted by coverage.py and were excluded only by the codecov ignore. Use */baselines/* to match both. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Update citation to ICASSP 2026 publication Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Restore title casing in citation title Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The word-level pass anchors token-level matches before beam search, so only the ambiguous spans between anchors get beam-searched. The existing pass (
align_with_word_level_pass) builds the full backtrace graph and uses only matches common to all optimal paths (get_unambiguous_node_matches) — safe, but expensive (it materializes the whole DP backtrace graph).This PR adds an alternative pass that takes the matches from a single optimal Levenshtein alignment via
rapidfuzz, and makes it the default. The catch:rapidfuzzonly emits the non-match operations (insert/delete/replace), so matches are inferred as the complement of the edited token indices (the same trick already used inbaselines/rapidfuzz_word_alignment.py).Selectable via a new
word_level_methodarg onerror_align():"rapidfuzz"(default) — matches from one optimal Levenshtein alignment."unambiguous"— the existing graph-based pass.Tradeoff
rapidfuzz fixes anchors from one optimal alignment rather than only the provably-unambiguous ones, so it commits to more split points (faster, smaller beam-search spans) but may anchor a match the full graph pass would have left for beam search. Match semantics are identical in both paths (exact equality of normalized tokens), so simple cases agree.
On a long earnings21 example the rapidfuzz pass ran ~34× faster (1.3 s vs 45 s).
Changes
align_with_rapidfuzz_word_level_pass+get_rapidfuzz_match_indices.align_from_match_indices(used by both passes).word_level_methodinerror_align(), with a clear error for unknown values.rapidfuzzto a core runtime dependency (was evaluation-only, Python-3.12-gated).0.1.0b10.Tests
test_error_alignparametrized over both methods (same expected op sequence).get_rapidfuzz_match_indices(complement inference,(hyp_idx, ref_idx)ordering).word_level_methodraisesValueError.All 17 tests pass (with typeguard); ruff clean.
🤖 Generated with Claude Code