Skip to content

Add rapidfuzz-based word-level pass (default)#21

Merged
Lasse Borgholt (borgholt) merged 9 commits into
mainfrom
rapidfuzz-word-level-backtrace
Jun 23, 2026
Merged

Add rapidfuzz-based word-level pass (default)#21
Lasse Borgholt (borgholt) merged 9 commits into
mainfrom
rapidfuzz-word-level-backtrace

Conversation

@borgholt

@borgholt Lasse Borgholt (borgholt) commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Summary

The word-level pass anchors token-level matches before beam search, so only the ambiguous spans between anchors get beam-searched. The existing pass (align_with_word_level_pass) builds the full backtrace graph and uses only matches common to all optimal paths (get_unambiguous_node_matches) — safe, but expensive (it materializes the whole DP backtrace graph).

This PR adds an alternative pass that takes the matches from a single optimal Levenshtein alignment via rapidfuzz, and makes it the default. The catch: rapidfuzz only emits the non-match operations (insert/delete/replace), so matches are inferred as the complement of the edited token indices (the same trick already used in baselines/rapidfuzz_word_alignment.py).

Selectable via a new word_level_method arg on error_align():

  • "rapidfuzz" (default) — matches from one optimal Levenshtein alignment.
  • "unambiguous" — the existing graph-based pass.

Tradeoff

rapidfuzz fixes anchors from one optimal alignment rather than only the provably-unambiguous ones, so it commits to more split points (faster, smaller beam-search spans) but may anchor a match the full graph pass would have left for beam search. Match semantics are identical in both paths (exact equality of normalized tokens), so simple cases agree.

On a long earnings21 example the rapidfuzz pass ran ~34× faster (1.3 s vs 45 s).

Changes

  • Add align_with_rapidfuzz_word_level_pass + get_rapidfuzz_match_indices.
  • Extract the shared subspan-extraction loop into align_from_match_indices (used by both passes).
  • Dispatch on word_level_method in error_align(), with a clear error for unknown values.
  • Promote rapidfuzz to a core runtime dependency (was evaluation-only, Python-3.12-gated).
  • Bump version to 0.1.0b10.

Tests

  • test_error_align parametrized over both methods (same expected op sequence).
  • Unit test for get_rapidfuzz_match_indices (complement inference, (hyp_idx, ref_idx) ordering).
  • Test that an unknown word_level_method raises ValueError.

All 17 tests pass (with typeguard); ruff clean.

🤖 Generated with Claude Code

The word-level pass anchors token-level matches before beam search so that
only ambiguous spans between anchors are beam-searched. The existing pass
builds the full backtrace graph and uses only matches common to all optimal
paths (get_unambiguous_node_matches), which is safe but expensive.

Add an alternative pass that takes the matches from a single optimal
Levenshtein alignment via rapidfuzz. rapidfuzz only emits non-match ops, so
matches are inferred as the complement of the edited token indices. This is
selectable via the new word_level_method arg ("rapidfuzz" default,
"unambiguous" for the graph-based pass).

- Promote rapidfuzz to a core runtime dependency (was evaluation-only).
- Extract the shared subspan-extraction loop into align_from_match_indices.
- Bump version to 0.1.0b10.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 22, 2026 13:20
@codecov

codecov Bot commented Jun 22, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 50.47%. Comparing base (902d623) to head (01f521a).

Additional details and impacted files
@@             Coverage Diff             @@
##             main      #21       +/-   ##
===========================================
- Coverage   93.13%   50.47%   -42.67%     
===========================================
  Files           9       14        +5     
  Lines         641     1369      +728     
  Branches      104      229      +125     
===========================================
+ Hits          597      691       +94     
- Misses         17      651      +634     
  Partials       27       27               

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR makes the word-level anchoring pass faster by defaulting to a rapidfuzz-derived single optimal Levenshtein alignment (instead of materializing the full DP backtrace graph), and adds a new word_level_method switch on error_align() to select between the two approaches.

Changes:

  • Add a rapidfuzz-based word-level pass (align_with_rapidfuzz_word_level_pass) and helper get_rapidfuzz_match_indices.
  • Extract shared “anchor spans then beam-search gaps” logic into align_from_match_indices and reuse it from both word-level passes.
  • Promote rapidfuzz to a core dependency, bump version, and expand tests to cover both methods + unknown method errors.

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 2 comments.

File Description
src/error_align/error_align.py Adds word_level_method dispatch, rapidfuzz-based anchoring, and shared anchor-to-subspan extraction.
tests/test_default.py Parametrizes test_error_align over both methods and adds unit/validation tests for the new API.
pyproject.toml Bumps version and moves rapidfuzz into core runtime dependencies.
poetry.lock Updates lockfile to reflect rapidfuzz moving from optional to main dependency.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/error_align/error_align.py
Comment thread tests/test_default.py Outdated
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- align_from_match_indices: reorder the loop-tail unpacking to hyp-first to
  match the rest of the function (no behavior change).
- Parametrize test_get_rapidfuzz_match_indices over replace/insert/delete so
  the complement logic is covered when ref/hyp lengths differ.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@borgholt Lasse Borgholt (borgholt) merged commit 9c5a1d4 into main Jun 23, 2026
9 checks passed
@borgholt Lasse Borgholt (borgholt) deleted the rapidfuzz-word-level-backtrace branch June 23, 2026 07:16
Lasse Borgholt (borgholt) added a commit that referenced this pull request Jun 23, 2026
The baselines/ modules (POWER, etc.) are optional and largely untested.
They only started counting toward coverage when rapidfuzz became a core
dependency (#21), since that made error_align.baselines importable in CI —
dropping reported coverage from ~93% to ~50%. The codecov.yml ignore glob
("src/error_align/baselines/*") never matched their installed path
("error_align/baselines/..."), so they slipped through.

Omit them at the coverage layer (source-independent) and fix the codecov
ignore glob to match any baselines path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Lasse Borgholt (borgholt) added a commit that referenced this pull request Jun 23, 2026
* Fix coverage double-counting between src/ and installed package

CI installs the package non-editably (to compile the C++ extension), so
imports resolve to site-packages while --cov=src separately measured the
untouched src/ tree at 0% — doubling the denominator and roughly halving
reported coverage (~93% -> ~50%).

Measure a single source (--cov=error_align) and add [tool.coverage.paths]
to merge the src/ and site-packages copies into one logical path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Omit research baselines from coverage measurement

The baselines/ modules (POWER, etc.) are optional and largely untested.
They only started counting toward coverage when rapidfuzz became a core
dependency (#21), since that made error_align.baselines importable in CI —
dropping reported coverage from ~93% to ~50%. The codecov.yml ignore glob
("src/error_align/baselines/*") never matched their installed path
("error_align/baselines/..."), so they slipped through.

Omit them at the coverage layer (source-independent) and fix the codecov
ignore glob to match any baselines path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Consolidate coverage config in .coveragerc (address Copilot review)

.coveragerc takes precedence over pyproject.toml, so the [tool.coverage.*]
config added earlier was silently ignored — two configs that could diverge.
Remove it and keep .coveragerc as the single source of truth.

Also fix its omit glob: src/error_align/baselines/* only matched the editable
layout, so in CI (site-packages) baselines were not omitted by coverage.py and
were excluded only by the codecov ignore. Use */baselines/* to match both.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Update citation to ICASSP 2026 publication

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Restore title casing in citation title

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants