github · aneubeck · Jun 10, 2026 · Jun 10, 2026 · Jun 10, 2026
@@ -9,6 +9,9 @@ multiple GiB/s — several × faster than a `HashMap` fold table — while using
 form, as defined by the Unicode [CaseFolding.txt][cf] data file restricted to
 the **simple** (1-to-1) folds (statuses `C` and `S`). Full multi-character
 folds (`F`, e.g. `ß` → `ss`) and Turkic locale folds (`T`) are not supported.
+The crate also provides [`index_fold`](#single-byte-index-fold), which projects
+every character — ASCII or multibyte — onto a single byte, a handy primitive for
+case-insensitive n-gram indexing.
 
 [cf]: https://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt
 
@@ -25,6 +28,56 @@ assert_eq!(simple_fold("Hello, WORLD!".to_string()), "hello, world!");
 assert_eq!(simple_fold("ÜBER".to_string()), "über");
 ```
 
+## Single-byte index fold
+
+`index_fold(s: String) -> Vec<u8>` applies the **same** simple fold as
+`simple_fold`, then collapses **every character to exactly one byte**:
+
+- ASCII characters become their plain lowercased byte (high bit clear).
+- Every multibyte character becomes `0x80 | (cp & 0x7F)` — the low 7 bits of its
+  *folded* code point, with the high bit set. The high bit is set
+  unconditionally, so even a multibyte character that folds to ASCII (e.g.
+  U+212A KELVIN SIGN → `k`) yields `0x80 | b'k'`, never a bare ASCII byte.
+
+```rust
+use casefold::index_fold;
+assert_eq!(index_fold("Hi!".to_string()), b"hi!");
+assert_eq!(index_fold("Ü".to_string()), &[0xFC]);          // ü → 0x80 | (0xFC & 0x7F)
+assert_eq!(index_fold("中".to_string()), &[0x80 | 0x2D]);
+```
+
+The result is fixed-width (one byte per character) and is therefore **not**
+valid UTF-8. To fold a single code point, use `index_fold_char(c: char) -> u8`,
+which returns the same byte `index_fold` would emit for that character.
+
+### Why one byte per character?
+
+This is a building block for **case-insensitive n-gram indexing**. When every
+character — ASCII or not — is reduced to a single byte, a fixed *k*-gram is just
+*k* contiguous bytes: byte n-grams are trivial to slice, hash, and store, they
+are already case-folded so lookups are case-insensitive for free, and a document
+of *n* characters yields exactly *n* index bytes. ASCII keeps its natural byte,
+and multibyte scripts are projected onto the high half (`0x80–0xFF`) so they
+never collide with ASCII.
+
+The projection is intentionally **lossy** — distinct code points that share the
+same low 7 bits map to the same byte (most CJK, for instance, lands in a narrow
+band). That is fine for an index: use `index_fold` as a cheap *candidate filter*
+that never produces false negatives for a case-insensitive match, then verify
+exact hits against the original text afterwards.
+
+Mechanically it reuses the whole fold table; the only addition is a per-run
+7-bit `INDEX_DELTA`. By modular arithmetic the folded low 7 bits are
+`((cp & 0x7F) + (delta & 0x7F)) mod 128`, so the fold is a single
+`wrapping_add` — no UTF-8 reconstruction, no decode, no encode (the stray carry
+bit is overwritten by the unconditional `0x80 |`). Because the output is never
+longer than the input, it runs fully in place in the input's own buffer, and
+pure-ASCII input is returned untouched. It shares `simple_fold`'s
+auto-vectorized ASCII pass (~46 GiB/s) and, since it emits one byte per
+character, runs *faster* than `simple_fold` on folding-heavy input (e.g. ~1.9
+vs ~1.3 GiB/s on length-changing folds, ~1.1 vs ~0.9 GiB/s on mixed BMP) and a
+little slower on pure-reject CJK/symbols due to character collapsing.
+
 ## Why does this crate exist?
 
 Unicode 16.0 defines 1484 simple-fold mappings. Common ways to store them:
@@ -68,7 +121,7 @@ query:
    `wrapping_add`, one 4-byte store — no decode, no encode. Writing fewer/more
    bytes than were read handles length-changing folds (`K`→`k`, `Ⱥ`→`ⱥ`).
 
-### Table layout (1776 B total)
+### Table layout (2014 B total)
 
 | Component                                       | Bytes |
 |-------------------------------------------------|------:|
@@ -78,8 +131,11 @@ query:
 | `RUN_END_LOW[238 + 8]: u8` (clean scan key, `end & 0x3F`; +8 SWAR pad) | 246 |
 | `RUN_START_STRIDE[238]: u8` (`start & 0x3F` \| stride bit) | 238 |
 | `BYTE_DELTA[238]: u32` (little-endian fold delta per run) | 952 |
-| **Total**                                       | **1776** |
+| `INDEX_DELTA[238]: u8` (7-bit per-run fold delta, `index_fold` only) | 238 |
+| **Total**                                       | **2014** |
 
+The `simple_fold` path uses 1776 B of this; the 238 B `INDEX_DELTA` side table
+powers [`index_fold`](#single-byte-index-fold) only.
 (Splitting runs at byte-delta boundaries raises the run count from 227 to 238.)
 The data file is parsed at build time by `build.rs`, which emits the packed
 `static` tables to `OUT_DIR/table.rs`.

@@ -1,7 +1,8 @@
 //! Benchmarks for `casefold::simple_fold`, comparing it against several
-//! baselines on representative inputs. Each input is run through six variants:
+//! baselines on representative inputs. Each input is run through these variants:
 //!
 //! - `casefold::simple_fold` — the implementation under test.
+//! - `casefold::index_fold` — the one-byte-per-character index fold.
 //! - `HashMap::fold_into_bytes` — a HashMap-based case fold over raw UTF-8.
 //! - `str::to_lowercase` — straightforward Unicode lowercasing baseline.
 //! - `chars().flat_map(to_lowercase)` — the per-char flat-map variant.
@@ -13,7 +14,7 @@
 //! cases (e.g. `Σ` final-sigma context, `İ` → `i\u{0307}`). These benchmarks
 //! are about throughput on equivalent workloads, not output equality.
 
-use casefold::{simple_fold, utf8_len};
+use casefold::{index_fold, simple_fold, utf8_len};
 use casefold_benchmarks::{hashmap_fold_utf8, reference_map_utf8, FoldHashMap};
 use criterion::{criterion_group, criterion_main, BenchmarkId, Criterion, Throughput};
 use std::hint::black_box;
@@ -156,6 +157,14 @@ fn bench_conversion(c: &mut Criterion, name: &str, input: &str) {
         },
     );
 
+    group.bench_function(BenchmarkId::new("Casefold::index_fold", input.len()), |b| {
+        b.iter_batched(
+            || input.to_string(),
+            |s| index_fold(black_box(s)),
+            criterion::BatchSize::SmallInput,
+        );
+    });
+
     let fold_map = reference_map_utf8();
     group.bench_function(
         BenchmarkId::new("HashMap::fold_into_bytes (UTF-8 u32)", input.len()),

@@ -307,9 +307,21 @@ fn emit_tables(folds: &[Fold], runs: &[Run]) -> String {
         .max()
         .unwrap_or(0);
 
+    // Parallel 7-bit index deltas, one per run, for `index_fold`. The fold
+    // collapses each code point to `cp & 0x7F`; by modular arithmetic the folded
+    // low-7-bit value is `((cp & 0x7F) + (delta & 0x7F)) & 0x7F`, so storing the
+    // code-point delta reduced mod 128 lets `index_fold` derive the folded index
+    // byte with one `wrapping_add` + mask — no UTF-8 reconstruction. The high
+    // bit is added unconditionally at write time, so only 7 bits are stored.
+    let index_deltas: Vec<u8> = runs.iter().map(|r| (r.delta & 0x7F) as u8).collect();
+
     // Sanity: size accounting (printed as build warnings for visibility).
     let index_bytes = page_bitmap.len() * 8 + popcnt_samples.len() + page_offset.len();
-    let total = index_bytes + run_end_low.len() + run_start_stride.len() + byte_deltas.len() * 4;
+    let total = index_bytes
+        + run_end_low.len()
+        + run_start_stride.len()
+        + byte_deltas.len() * 4
+        + index_deltas.len();
     if env::var_os("CASEFOLD_BUILD_INFO").is_some() {
         println!(
             "cargo:warning=casefold table: {} fold entries, {} runs, {} populated pages, {} bytes total ({:.2} bits/entry), max |delta| = {}, max |byte_delta| = {}",
@@ -338,6 +350,7 @@ fn emit_tables(folds: &[Fold], runs: &[Run]) -> String {
     emit_u8_array(&mut s, "RUN_END_LOW", &run_end_low);
     emit_u8_array(&mut s, "RUN_START_STRIDE", &run_start_stride);
     emit_u32_array(&mut s, "BYTE_DELTA", &byte_deltas);
+    emit_u8_array(&mut s, "INDEX_DELTA", &index_deltas);
 
     s
 }