Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 58 additions & 2 deletions crates/casefold/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,9 @@ multiple GiB/s — several × faster than a `HashMap` fold table — while using
form, as defined by the Unicode [CaseFolding.txt][cf] data file restricted to
the **simple** (1-to-1) folds (statuses `C` and `S`). Full multi-character
folds (`F`, e.g. `ß` → `ss`) and Turkic locale folds (`T`) are not supported.
The crate also provides [`index_fold`](#single-byte-index-fold), which projects
every character — ASCII or multibyte — onto a single byte, a handy primitive for
case-insensitive n-gram indexing.

[cf]: https://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt

Expand All @@ -25,6 +28,56 @@ assert_eq!(simple_fold("Hello, WORLD!".to_string()), "hello, world!");
assert_eq!(simple_fold("ÜBER".to_string()), "über");
```

## Single-byte index fold

`index_fold(s: String) -> Vec<u8>` applies the **same** simple fold as
`simple_fold`, then collapses **every character to exactly one byte**:

- ASCII characters become their plain lowercased byte (high bit clear).
- Every multibyte character becomes `0x80 | (cp & 0x7F)` — the low 7 bits of its
*folded* code point, with the high bit set. The high bit is set
unconditionally, so even a multibyte character that folds to ASCII (e.g.
U+212A KELVIN SIGN → `k`) yields `0x80 | b'k'`, never a bare ASCII byte.

```rust
use casefold::index_fold;
assert_eq!(index_fold("Hi!".to_string()), b"hi!");
assert_eq!(index_fold("Ü".to_string()), &[0xFC]); // ü → 0x80 | (0xFC & 0x7F)
assert_eq!(index_fold("中".to_string()), &[0x80 | 0x2D]);
```

The result is fixed-width (one byte per character) and is therefore **not**
valid UTF-8. To fold a single code point, use `index_fold_char(c: char) -> u8`,
which returns the same byte `index_fold` would emit for that character.

### Why one byte per character?

This is a building block for **case-insensitive n-gram indexing**. When every
character — ASCII or not — is reduced to a single byte, a fixed *k*-gram is just
*k* contiguous bytes: byte n-grams are trivial to slice, hash, and store, they
are already case-folded so lookups are case-insensitive for free, and a document
of *n* characters yields exactly *n* index bytes. ASCII keeps its natural byte,
and multibyte scripts are projected onto the high half (`0x80–0xFF`) so they
never collide with ASCII.

The projection is intentionally **lossy** — distinct code points that share the
same low 7 bits map to the same byte (most CJK, for instance, lands in a narrow
band). That is fine for an index: use `index_fold` as a cheap *candidate filter*
that never produces false negatives for a case-insensitive match, then verify
exact hits against the original text afterwards.

Mechanically it reuses the whole fold table; the only addition is a per-run
7-bit `INDEX_DELTA`. By modular arithmetic the folded low 7 bits are
`((cp & 0x7F) + (delta & 0x7F)) mod 128`, so the fold is a single
`wrapping_add` — no UTF-8 reconstruction, no decode, no encode (the stray carry
bit is overwritten by the unconditional `0x80 |`). Because the output is never
longer than the input, it runs fully in place in the input's own buffer, and
pure-ASCII input is returned untouched. It shares `simple_fold`'s
auto-vectorized ASCII pass (~46 GiB/s) and, since it emits one byte per
character, runs *faster* than `simple_fold` on folding-heavy input (e.g. ~1.9
vs ~1.3 GiB/s on length-changing folds, ~1.1 vs ~0.9 GiB/s on mixed BMP) and a
little slower on pure-reject CJK/symbols due to character collapsing.

## Why does this crate exist?

Unicode 16.0 defines 1484 simple-fold mappings. Common ways to store them:
Expand Down Expand Up @@ -68,7 +121,7 @@ query:
`wrapping_add`, one 4-byte store — no decode, no encode. Writing fewer/more
bytes than were read handles length-changing folds (`K`→`k`, `Ⱥ`→`ⱥ`).

### Table layout (1776 B total)
### Table layout (2014 B total)

| Component | Bytes |
|-------------------------------------------------|------:|
Expand All @@ -78,8 +131,11 @@ query:
| `RUN_END_LOW[238 + 8]: u8` (clean scan key, `end & 0x3F`; +8 SWAR pad) | 246 |
| `RUN_START_STRIDE[238]: u8` (`start & 0x3F` \| stride bit) | 238 |
| `BYTE_DELTA[238]: u32` (little-endian fold delta per run) | 952 |
| **Total** | **1776** |
| `INDEX_DELTA[238]: u8` (7-bit per-run fold delta, `index_fold` only) | 238 |
| **Total** | **2014** |

The `simple_fold` path uses 1776 B of this; the 238 B `INDEX_DELTA` side table
powers [`index_fold`](#single-byte-index-fold) only.
(Splitting runs at byte-delta boundaries raises the run count from 227 to 238.)
The data file is parsed at build time by `build.rs`, which emits the packed
`static` tables to `OUT_DIR/table.rs`.
Expand Down
13 changes: 11 additions & 2 deletions crates/casefold/benchmarks/conversion.rs
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
//! Benchmarks for `casefold::simple_fold`, comparing it against several
//! baselines on representative inputs. Each input is run through six variants:
//! baselines on representative inputs. Each input is run through these variants:
//!
//! - `casefold::simple_fold` — the implementation under test.
//! - `casefold::index_fold` — the one-byte-per-character index fold.
//! - `HashMap::fold_into_bytes` — a HashMap-based case fold over raw UTF-8.
//! - `str::to_lowercase` — straightforward Unicode lowercasing baseline.
//! - `chars().flat_map(to_lowercase)` — the per-char flat-map variant.
Expand All @@ -13,7 +14,7 @@
//! cases (e.g. `Σ` final-sigma context, `İ` → `i\u{0307}`). These benchmarks
//! are about throughput on equivalent workloads, not output equality.

use casefold::{simple_fold, utf8_len};
use casefold::{index_fold, simple_fold, utf8_len};
use casefold_benchmarks::{hashmap_fold_utf8, reference_map_utf8, FoldHashMap};
use criterion::{criterion_group, criterion_main, BenchmarkId, Criterion, Throughput};
use std::hint::black_box;
Expand Down Expand Up @@ -156,6 +157,14 @@ fn bench_conversion(c: &mut Criterion, name: &str, input: &str) {
},
);

group.bench_function(BenchmarkId::new("Casefold::index_fold", input.len()), |b| {
b.iter_batched(
|| input.to_string(),
|s| index_fold(black_box(s)),
criterion::BatchSize::SmallInput,
);
});

let fold_map = reference_map_utf8();
group.bench_function(
BenchmarkId::new("HashMap::fold_into_bytes (UTF-8 u32)", input.len()),
Expand Down
15 changes: 14 additions & 1 deletion crates/casefold/build.rs
Original file line number Diff line number Diff line change
Expand Up @@ -307,9 +307,21 @@ fn emit_tables(folds: &[Fold], runs: &[Run]) -> String {
.max()
.unwrap_or(0);

// Parallel 7-bit index deltas, one per run, for `index_fold`. The fold
// collapses each code point to `cp & 0x7F`; by modular arithmetic the folded
// low-7-bit value is `((cp & 0x7F) + (delta & 0x7F)) & 0x7F`, so storing the
// code-point delta reduced mod 128 lets `index_fold` derive the folded index
// byte with one `wrapping_add` + mask — no UTF-8 reconstruction. The high
// bit is added unconditionally at write time, so only 7 bits are stored.
Comment on lines +311 to +315
let index_deltas: Vec<u8> = runs.iter().map(|r| (r.delta & 0x7F) as u8).collect();

// Sanity: size accounting (printed as build warnings for visibility).
let index_bytes = page_bitmap.len() * 8 + popcnt_samples.len() + page_offset.len();
let total = index_bytes + run_end_low.len() + run_start_stride.len() + byte_deltas.len() * 4;
let total = index_bytes
+ run_end_low.len()
+ run_start_stride.len()
+ byte_deltas.len() * 4
+ index_deltas.len();
if env::var_os("CASEFOLD_BUILD_INFO").is_some() {
println!(
"cargo:warning=casefold table: {} fold entries, {} runs, {} populated pages, {} bytes total ({:.2} bits/entry), max |delta| = {}, max |byte_delta| = {}",
Expand Down Expand Up @@ -338,6 +350,7 @@ fn emit_tables(folds: &[Fold], runs: &[Run]) -> String {
emit_u8_array(&mut s, "RUN_END_LOW", &run_end_low);
emit_u8_array(&mut s, "RUN_START_STRIDE", &run_start_stride);
emit_u32_array(&mut s, "BYTE_DELTA", &byte_deltas);
emit_u8_array(&mut s, "INDEX_DELTA", &index_deltas);

s
}
Expand Down
Loading