diff --git a/crates/casefold/README.md b/crates/casefold/README.md
index 211fe24..6487f58 100644
--- a/crates/casefold/README.md
+++ b/crates/casefold/README.md
@@ -9,6 +9,9 @@ multiple GiB/s — several × faster than a `HashMap` fold table — while using
 form, as defined by the Unicode [CaseFolding.txt][cf] data file restricted to
 the **simple** (1-to-1) folds (statuses `C` and `S`). Full multi-character
 folds (`F`, e.g. `ß` → `ss`) and Turkic locale folds (`T`) are not supported.
+The crate also provides [`index_fold`](#single-byte-index-fold), which projects
+every character — ASCII or multibyte — onto a single byte, a handy primitive for
+case-insensitive n-gram indexing.
 
 [cf]: https://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt
 
@@ -25,6 +28,56 @@ assert_eq!(simple_fold("Hello, WORLD!".to_string()), "hello, world!");
 assert_eq!(simple_fold("ÜBER".to_string()), "über");
 ```
 
+## Single-byte index fold
+
+`index_fold(s: String) -> Vec<u8>` applies the **same** simple fold as
+`simple_fold`, then collapses **every character to exactly one byte**:
+
+- ASCII characters become their plain lowercased byte (high bit clear).
+- Every multibyte character becomes `0x80 | (cp & 0x7F)` — the low 7 bits of its
+  *folded* code point, with the high bit set. The high bit is set
+  unconditionally, so even a multibyte character that folds to ASCII (e.g.
+  U+212A KELVIN SIGN → `k`) yields `0x80 | b'k'`, never a bare ASCII byte.
+
+```rust
+use casefold::index_fold;
+assert_eq!(index_fold("Hi!".to_string()), b"hi!");
+assert_eq!(index_fold("Ü".to_string()), &[0xFC]);          // ü → 0x80 | (0xFC & 0x7F)
+assert_eq!(index_fold("中".to_string()), &[0x80 | 0x2D]);
+```
+
+The result is fixed-width (one byte per character) and is therefore **not**
+valid UTF-8. To fold a single code point, use `index_fold_char(c: char) -> u8`,
+which returns the same byte `index_fold` would emit for that character.
+
+### Why one byte per character?
+
+This is a building block for **case-insensitive n-gram indexing**. When every
+character — ASCII or not — is reduced to a single byte, a fixed *k*-gram is just
+*k* contiguous bytes: byte n-grams are trivial to slice, hash, and store, they
+are already case-folded so lookups are case-insensitive for free, and a document
+of *n* characters yields exactly *n* index bytes. ASCII keeps its natural byte,
+and multibyte scripts are projected onto the high half (`0x80–0xFF`) so they
+never collide with ASCII.
+
+The projection is intentionally **lossy** — distinct code points that share the
+same low 7 bits map to the same byte (most CJK, for instance, lands in a narrow
+band). That is fine for an index: use `index_fold` as a cheap *candidate filter*
+that never produces false negatives for a case-insensitive match, then verify
+exact hits against the original text afterwards.
+
+Mechanically it reuses the whole fold table; the only addition is a per-run
+7-bit `INDEX_DELTA`. By modular arithmetic the folded low 7 bits are
+`((cp & 0x7F) + (delta & 0x7F)) mod 128`, so the fold is a single
+`wrapping_add` — no UTF-8 reconstruction, no decode, no encode (the stray carry
+bit is overwritten by the unconditional `0x80 |`). Because the output is never
+longer than the input, it runs fully in place in the input's own buffer, and
+pure-ASCII input is returned untouched. It shares `simple_fold`'s
+auto-vectorized ASCII pass (~46 GiB/s) and, since it emits one byte per
+character, runs *faster* than `simple_fold` on folding-heavy input (e.g. ~1.9
+vs ~1.3 GiB/s on length-changing folds, ~1.1 vs ~0.9 GiB/s on mixed BMP) and a
+little slower on pure-reject CJK/symbols due to character collapsing.
+
 ## Why does this crate exist?
 
 Unicode 16.0 defines 1484 simple-fold mappings. Common ways to store them:
@@ -68,7 +121,7 @@ query:
    `wrapping_add`, one 4-byte store — no decode, no encode. Writing fewer/more
    bytes than were read handles length-changing folds (`K`→`k`, `Ⱥ`→`ⱥ`).
 
-### Table layout (1776 B total)
+### Table layout (2014 B total)
 
 | Component                                       | Bytes |
 |-------------------------------------------------|------:|
@@ -78,8 +131,11 @@ query:
 | `RUN_END_LOW[238 + 8]: u8` (clean scan key, `end & 0x3F`; +8 SWAR pad) | 246 |
 | `RUN_START_STRIDE[238]: u8` (`start & 0x3F` \| stride bit) | 238 |
 | `BYTE_DELTA[238]: u32` (little-endian fold delta per run) | 952 |
-| **Total**                                       | **1776** |
+| `INDEX_DELTA[238]: u8` (7-bit per-run fold delta, `index_fold` only) | 238 |
+| **Total**                                       | **2014** |
 
+The `simple_fold` path uses 1776 B of this; the 238 B `INDEX_DELTA` side table
+powers [`index_fold`](#single-byte-index-fold) only.
 (Splitting runs at byte-delta boundaries raises the run count from 227 to 238.)
 The data file is parsed at build time by `build.rs`, which emits the packed
 `static` tables to `OUT_DIR/table.rs`.
diff --git a/crates/casefold/benchmarks/conversion.rs b/crates/casefold/benchmarks/conversion.rs
index 8a73c76..de7cab4 100644
--- a/crates/casefold/benchmarks/conversion.rs
+++ b/crates/casefold/benchmarks/conversion.rs
@@ -1,7 +1,8 @@
 //! Benchmarks for `casefold::simple_fold`, comparing it against several
-//! baselines on representative inputs. Each input is run through six variants:
+//! baselines on representative inputs. Each input is run through these variants:
 //!
 //! - `casefold::simple_fold` — the implementation under test.
+//! - `casefold::index_fold` — the one-byte-per-character index fold.
 //! - `HashMap::fold_into_bytes` — a HashMap-based case fold over raw UTF-8.
 //! - `str::to_lowercase` — straightforward Unicode lowercasing baseline.
 //! - `chars().flat_map(to_lowercase)` — the per-char flat-map variant.
@@ -13,7 +14,7 @@
 //! cases (e.g. `Σ` final-sigma context, `İ` → `i\u{0307}`). These benchmarks
 //! are about throughput on equivalent workloads, not output equality.
 
-use casefold::{simple_fold, utf8_len};
+use casefold::{index_fold, simple_fold, utf8_len};
 use casefold_benchmarks::{hashmap_fold_utf8, reference_map_utf8, FoldHashMap};
 use criterion::{criterion_group, criterion_main, BenchmarkId, Criterion, Throughput};
 use std::hint::black_box;
@@ -156,6 +157,14 @@ fn bench_conversion(c: &mut Criterion, name: &str, input: &str) {
         },
     );
 
+    group.bench_function(BenchmarkId::new("Casefold::index_fold", input.len()), |b| {
+        b.iter_batched(
+            || input.to_string(),
+            |s| index_fold(black_box(s)),
+            criterion::BatchSize::SmallInput,
+        );
+    });
+
     let fold_map = reference_map_utf8();
     group.bench_function(
         BenchmarkId::new("HashMap::fold_into_bytes (UTF-8 u32)", input.len()),
diff --git a/crates/casefold/build.rs b/crates/casefold/build.rs
index 17a62e7..be90b22 100644
--- a/crates/casefold/build.rs
+++ b/crates/casefold/build.rs
@@ -307,9 +307,21 @@ fn emit_tables(folds: &[Fold], runs: &[Run]) -> String {
         .max()
         .unwrap_or(0);
 
+    // Parallel 7-bit index deltas, one per run, for `index_fold`. The fold
+    // collapses each code point to `cp & 0x7F`; by modular arithmetic the folded
+    // low-7-bit value is `((cp & 0x7F) + (delta & 0x7F)) & 0x7F`, so storing the
+    // code-point delta reduced mod 128 lets `index_fold` derive the folded index
+    // byte with one `wrapping_add` + mask — no UTF-8 reconstruction. The high
+    // bit is added unconditionally at write time, so only 7 bits are stored.
+    let index_deltas: Vec<u8> = runs.iter().map(|r| (r.delta & 0x7F) as u8).collect();
+
     // Sanity: size accounting (printed as build warnings for visibility).
     let index_bytes = page_bitmap.len() * 8 + popcnt_samples.len() + page_offset.len();
-    let total = index_bytes + run_end_low.len() + run_start_stride.len() + byte_deltas.len() * 4;
+    let total = index_bytes
+        + run_end_low.len()
+        + run_start_stride.len()
+        + byte_deltas.len() * 4
+        + index_deltas.len();
     if env::var_os("CASEFOLD_BUILD_INFO").is_some() {
         println!(
             "cargo:warning=casefold table: {} fold entries, {} runs, {} populated pages, {} bytes total ({:.2} bits/entry), max |delta| = {}, max |byte_delta| = {}",
@@ -338,6 +350,7 @@ fn emit_tables(folds: &[Fold], runs: &[Run]) -> String {
     emit_u8_array(&mut s, "RUN_END_LOW", &run_end_low);
     emit_u8_array(&mut s, "RUN_START_STRIDE", &run_start_stride);
     emit_u32_array(&mut s, "BYTE_DELTA", &byte_deltas);
+    emit_u8_array(&mut s, "INDEX_DELTA", &index_deltas);
 
     s
 }
diff --git a/crates/casefold/src/index_fold.rs b/crates/casefold/src/index_fold.rs
new file mode 100644
index 0000000..fddb2be
--- /dev/null
+++ b/crates/casefold/src/index_fold.rs
@@ -0,0 +1,287 @@
+//! Compact one-byte-per-character *index* fold, built on the same paged-bitmap
+//! run table as [`simple_fold`](crate::simple_fold).
+
+use crate::table::*;
+use crate::{popcount_up_to, scan_end_low};
+
+/// Consumes `s` and returns its simple case-folded form as a compact byte
+/// *index*: each character is folded with the same simple (1-to-1) fold as
+/// [`simple_fold`](crate::simple_fold), then collapsed to **exactly one byte**
+/// per input character.
+///
+/// Single-byte (ASCII) characters are emitted as their plain lowercased byte
+/// (high bit clear). Every multibyte character is replaced by the single byte
+/// `0x80 | (cp & 0x7F)`: the low 7 bits of its *folded* code point with the high
+/// bit set. The high bit is set unconditionally, so a multibyte character that
+/// folds to ASCII (e.g. U+212A KELVIN SIGN → `k`) still yields a high-bit byte
+/// (`0x80 | b'k'`), not the bare ASCII byte.
+///
+/// The result has one byte per character and is therefore **not** valid UTF-8.
+/// It is intended as a cheap, fixed-width key for case-insensitive indexing or
+/// hashing where collisions between code points sharing the same low 7 bits are
+/// acceptable.
+///
+/// Because every character collapses to exactly one byte, the output is never
+/// longer than the input; pure-ASCII input is folded in place (the input's heap
+/// buffer is returned untouched), and once a multibyte character is hit the
+/// remainder is rewritten in place with a write cursor that never overtakes the
+/// read cursor, so no second buffer is ever allocated.
+///
+/// Like [`simple_fold`](crate::simple_fold), characters are never fully decoded
+/// and the fold needs no UTF-8 reconstruction: the page coordinates come from
+/// the lead/continuation bytes, and on a fold hit the folded low 7 bits are
+/// `(cp & 0x7F)` plus the run's 7-bit `INDEX_DELTA`, masked back to 7 bits.
+///
+/// # Example
+///
+/// ```
+/// use casefold::index_fold;
+/// assert_eq!(index_fold("Hi!".to_string()), b"hi!");
+/// // U+212A KELVIN SIGN folds to ASCII 'k', but the high bit is still set:
+/// assert_eq!(index_fold("\u{212A}".to_string()), &[0x80 | b'k']);
+/// // 'Ü' (U+00DC) folds to 'ü' (U+00FC); 0x80 | (0xFC & 0x7F) == 0xFC:
+/// assert_eq!(index_fold("Ü".to_string()), &[0xFC]);
+/// ```
+pub fn index_fold(s: String) -> Vec<u8> {
+    let mut bytes = s.into_bytes();
+    // Tier 1 — vectorizable straight-through pass (identical to `fold_into_bytes`):
+    // lowercase every ASCII A..Z byte in place and OR all bytes together so a
+    // single sign-bit test tells us whether any multibyte sequence is present.
+    let mut high_bit_acc: u8 = 0;
+    for b in &mut bytes {
+        high_bit_acc |= *b;
+        let is_upper = b.wrapping_sub(b'A') < 26;
+        *b |= u8::from(is_upper) << 5;
+    }
+    if high_bit_acc & 0x80 == 0 {
+        // Pure ASCII: already folded in place, one byte per character.
+        return bytes;
+    }
+    // Tier 2 — collapse each character to one index byte, in place. The ASCII
+    // prefix (already lowercased above, one byte per char) is left untouched;
+    // from the first non-ASCII byte we rewrite with a `write` cursor that, since
+    // every character yields exactly one byte from its >= 1 source bytes, never
+    // overtakes `read`.
+    let first_non_ascii = bytes
+        .iter()
+        .position(|&b| b & 0x80 != 0)
+        .expect("a non-ASCII byte exists (the high-bit accumulator was set)");
+    let mut write = first_non_ascii;
+    let mut read = first_non_ascii;
+    while read < bytes.len() {
+        let lead = bytes[read];
+        // ASCII (already lowercased by tier 1): copy through as a single byte.
+        if lead & 0x80 == 0 {
+            bytes[write] = lead;
+            write += 1;
+            read += 1;
+            continue;
+        }
+        // Multibyte: recover the `PAGE_BITMAP` coordinates of `cp >> 6`
+        // directly as `(word_idx, bit_idx)` — the high part `cp >> 12` indexes
+        // the bitmap word, the next 6 bits `(cp >> 6) & 63` index the bit —
+        // without ever materializing the combined page number.
+        let (word_idx, bit_idx, c_len) = if lead < 0xE0 {
+            (0usize, (lead & 0x1F) as u32, 2usize)
+        } else if lead < 0xF0 {
+            ((lead & 0x0F) as usize, (bytes[read + 1] & 0x3F) as u32, 3)
+        } else {
+            (
+                (((lead & 0x07) as usize) << 6) | (bytes[read + 1] & 0x3F) as usize,
+                (bytes[read + 2] & 0x3F) as u32,
+                4,
+            )
+        };
+        let low_v = bytes[read + c_len - 1] & 0x3F;
+        // The source code point's low 7 bits, `cp & 0x7F`, as `((cp >> 6) & 1)
+        // << 6 | (cp & 0x3F)`: `bit_idx`'s low bit is `(cp >> 6) & 1`. We don't
+        // mask `bit_idx` to one bit — its higher bits land in output bit 7+,
+        // which the unconditional `0x80 |` at write time overwrites anyway.
+        let mut folded_index = ((bit_idx << 6) as u8) | low_v;
+        if word_idx < PAGE_BITMAP.len() && (PAGE_BITMAP[word_idx] >> bit_idx) & 1 != 0 {
+            let dense = popcount_up_to(word_idx, bit_idx) as usize;
+            let lo = PAGE_OFFSET[dense] as usize;
+            let n = PAGE_OFFSET[dense + 1] as usize - lo;
+            let off = scan_end_low(lo, n, low_v);
+            if off < n {
+                let ss = RUN_START_STRIDE[lo + off];
+                let start_low = ss & 0x3F;
+                let stride_bit = ss >> 6;
+                if low_v >= start_low && ((low_v - start_low) & stride_bit) == 0 {
+                    // Folding character: by modular arithmetic the folded low 7
+                    // bits are `(cp & 0x7F) + (delta & 0x7F)) mod 128`, so adding
+                    // the run's 7-bit `INDEX_DELTA` yields them directly — no UTF-8
+                    // reconstruction. The add may carry into bit 7, but that bit
+                    // is overwritten by `0x80 |` below, so no `& 0x7F` is needed.
+                    folded_index = folded_index.wrapping_add(INDEX_DELTA[lo + off]);
+                }
+            }
+        }
+        // `write <= read` here, and the source bytes this character needs were
+        // all read above, so storing the single index byte never clobbers
+        // bytes still to be read. The high bit always marks a multibyte origin.
+        bytes[write] = 0x80 | folded_index;
+        write += 1;
+        read += c_len;
+    }
+    bytes.truncate(write);
+    bytes
+}
+
+/// Folds a single `char` to its one-byte [`index_fold`] representation.
+///
+/// Equivalent to the per-character output of [`index_fold`]: an ASCII `char`
+/// yields its lowercased byte (high bit clear); any other `char` yields
+/// `0x80 | (cp & 0x7F)` of its *folded* code point (high bit set), including a
+/// multibyte `char` that folds to ASCII (e.g. U+212A KELVIN SIGN → `0x80 | b'k'`).
+///
+/// # Example
+///
+/// ```
+/// use casefold::index_fold_char;
+/// assert_eq!(index_fold_char('A'), b'a');
+/// assert_eq!(index_fold_char('Ü'), 0xFC); // ü → 0x80 | (0xFC & 0x7F)
+/// assert_eq!(index_fold_char('中'), 0x80 | 0x2D);
+/// ```
+pub fn index_fold_char(c: char) -> u8 {
+    let cp = c as u32;
+    if cp < 0x80 {
+        // ASCII: lowercase A..Z (a no-op otherwise), high bit stays clear.
+        let b = cp as u8;
+        let is_upper = b.wrapping_sub(b'A') < 26;
+        return b | (u8::from(is_upper) << 5);
+    }
+    // Multibyte: the `PAGE_BITMAP` coordinates of `cp >> 6` are `word_idx =
+    // cp >> 12` (the bitmap word) and `bit_idx = (cp >> 6) & 63` (the bit).
+    let word_idx = (cp >> 12) as usize;
+    let bit_idx = (cp >> 6) & 0x3F;
+    let low_v = (cp & 0x3F) as u8;
+    // `cp & 0x7F` is the source low 7 bits; a fold adds the run's 7-bit delta.
+    let mut folded_index = (cp & 0x7F) as u8;
+    if word_idx < PAGE_BITMAP.len() && (PAGE_BITMAP[word_idx] >> bit_idx) & 1 != 0 {
+        let dense = popcount_up_to(word_idx, bit_idx) as usize;
+        let lo = PAGE_OFFSET[dense] as usize;
+        let n = PAGE_OFFSET[dense + 1] as usize - lo;
+        let off = scan_end_low(lo, n, low_v);
+        if off < n {
+            let ss = RUN_START_STRIDE[lo + off];
+            let start_low = ss & 0x3F;
+            let stride_bit = ss >> 6;
+            if low_v >= start_low && ((low_v - start_low) & stride_bit) == 0 {
+                folded_index = folded_index.wrapping_add(INDEX_DELTA[lo + off]);
+            }
+        }
+    }
+    0x80 | folded_index
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+    use crate::test_support::reference;
+    use std::collections::HashMap;
+
+    /// Per-character index fold via the reference map: fold each char, then
+    /// collapse it to one byte the same way [`index_fold`] does. The high bit is
+    /// set for every multibyte (source `cp >= 0x80`) character, even one that
+    /// folds to ASCII.
+    fn index_fold_oracle(r: &HashMap<u32, u32>, s: &str) -> Vec<u8> {
+        let mut out = Vec::new();
+        for c in s.chars() {
+            let cp = c as u32;
+            let folded = r.get(&cp).copied().unwrap_or(cp);
+            if cp < 0x80 {
+                out.push(folded as u8);
+            } else {
+                out.push(0x80 | (folded & 0x7F) as u8);
+            }
+        }
+        out
+    }
+
+    #[test]
+    fn index_fold_ascii() {
+        assert_eq!(index_fold(String::new()), b"");
+        assert_eq!(index_fold("Hello, WORLD!".into()), b"hello, world!");
+        assert_eq!(index_fold("abc 123 XYZ".into()), b"abc 123 xyz");
+    }
+
+    #[test]
+    fn index_fold_reuses_buffer_for_ascii_input() {
+        // Pure-ASCII input is folded in place; the returned Vec must hold the
+        // exact same allocation as the input String.
+        let s = "MIXED case AsCiI 12345".to_string();
+        let original_ptr = s.as_ptr();
+        let out = index_fold(s);
+        assert_eq!(out, b"mixed case ascii 12345");
+        assert_eq!(out.as_ptr(), original_ptr);
+    }
+
+    #[test]
+    fn index_fold_multibyte_to_single_byte() {
+        // Ü (U+00DC) folds to ü (U+00FC); 0x80 | (0xFC & 0x7F) == 0xFC.
+        assert_eq!(index_fold("Ü".into()), vec![0xFC]);
+        // Length-preserving fold of three 2-byte chars to one byte each.
+        assert_eq!(
+            index_fold("ÄÖÜ".into()),
+            vec![0x80 | 0x64, 0x80 | 0x76, 0xFC]
+        );
+        // Fold to ASCII keeps the high bit set: U+212A KELVIN SIGN -> 'k'.
+        assert_eq!(
+            index_fold("\u{212A}elvin".into()),
+            vec![0x80 | b'k', b'e', b'l', b'v', b'i', b'n'],
+        );
+        // Growing fold U+023A -> U+2C65: 0x80 | (0x2C65 & 0x7F) == 0xE5.
+        assert_eq!(index_fold("\u{023A}".into()), vec![0x80 | 0x65]);
+        // Non-folding multibyte still collapses to its low 7 bits.
+        assert_eq!(index_fold("中".into()), vec![0x80 | 0x2D]);
+    }
+
+    #[test]
+    fn index_fold_matches_reference_map() {
+        let r = reference();
+        let input = "Quick BROWN Fox 🦊 ÜBER Größe ΣΟΦΙΑ \u{0130}\u{023A}漢";
+        assert_eq!(index_fold(input.to_string()), index_fold_oracle(&r, input));
+    }
+
+    #[test]
+    fn index_fold_matches_reference_map_exhaustive() {
+        // Drive every assigned code point through the byte-oriented index path
+        // and cross-check against the reference fold map.
+        let r = reference();
+        let mut input = String::from("X");
+        for cp in 0x80..0x110000u32 {
+            if (0xD800..0xE000).contains(&cp) {
+                continue; // surrogates aren't valid chars
+            }
+            input.push(char::from_u32(cp).expect("cp is a valid non-surrogate char"));
+        }
+        let expected = index_fold_oracle(&r, &input);
+        assert_eq!(index_fold(input), expected);
+    }
+
+    #[test]
+    fn index_fold_char_examples() {
+        assert_eq!(index_fold_char('A'), b'a');
+        assert_eq!(index_fold_char('!'), b'!');
+        assert_eq!(index_fold_char('Ü'), 0xFC);
+        assert_eq!(index_fold_char('中'), 0x80 | 0x2D);
+        // Fold to ASCII keeps the high bit set.
+        assert_eq!(index_fold_char('\u{212A}'), 0x80 | b'k');
+    }
+
+    #[test]
+    fn index_fold_char_matches_index_fold_exhaustive() {
+        // Every code point's single-char `index_fold_char` must equal the lone
+        // byte `index_fold` produces for that character.
+        for cp in 0u32..0x110000 {
+            if (0xD800..0xE000).contains(&cp) {
+                continue; // surrogates aren't valid chars
+            }
+            let c = char::from_u32(cp).expect("cp is a valid non-surrogate char");
+            let folded = index_fold(c.to_string());
+            assert_eq!(folded.len(), 1, "cp {cp:#x} did not yield one byte");
+            assert_eq!(index_fold_char(c), folded[0], "cp {cp:#x}");
+        }
+    }
+}
diff --git a/crates/casefold/src/lib.rs b/crates/casefold/src/lib.rs
index f826fe1..31ccf3b 100644
--- a/crates/casefold/src/lib.rs
+++ b/crates/casefold/src/lib.rs
@@ -70,201 +70,12 @@ mod table {
     include!(concat!(env!("OUT_DIR"), "/table.rs"));
 }
 
-use table::*;
-
-/// Consumes `s` and returns its simple case-folded form as a `String`. The
-/// input's heap buffer is reused untouched whenever folding changes no bytes —
-/// that covers pure-ASCII / already-lowercase input (folded in place) *and*
-/// any input whose multibyte characters never fold (CJK, Hangul, Kana,
-/// Arabic, Hebrew, Indic, symbols, …). A fresh buffer is allocated only once
-/// an actual case fold is encountered; from there, unmodified spans are
-/// bulk-copied and folded characters are re-encoded in between.
-///
-/// Folds may shrink (e.g. U+212A KELVIN SIGN is 3 bytes but folds to `k` =
-/// 1 byte) or grow (e.g. U+023A `Ⱥ` is 2 bytes but folds to U+2C65 `ⱥ` =
-/// 3 bytes), so in-place rewriting isn't possible in general — but inputs that
-/// don't fold at all skip the second buffer entirely.
-///
-/// Only **simple** (1-to-1) folds are applied; multi-character folds such as
-/// `ß` → `ss` and Turkic locale folds are left unchanged.
-///
-/// # Example
-///
-/// ```
-/// use casefold::simple_fold;
-/// assert_eq!(simple_fold("Hello, WORLD!".to_string()), "hello, world!");
-/// assert_eq!(simple_fold("ÜBER".to_string()), "über");
-/// // Length-changing fold (U+212A KELVIN SIGN → U+006B, 3 bytes → 1 byte):
-/// assert_eq!(simple_fold("\u{212A}elvin".to_string()), "kelvin");
-/// ```
-pub fn simple_fold(s: String) -> String {
-    // SAFETY: `fold_into_bytes` only lowercases ASCII bytes in place and
-    // re-encodes whole characters through the fold table, so its output is
-    // always valid UTF-8 (see the exhaustive round-trip test).
-    unsafe { String::from_utf8_unchecked(fold_into_bytes(s)) }
-}
-
-/// Byte-level core of [`simple_fold`]. Returns the fold as a `Vec<u8>` that is
-/// always valid UTF-8; see [`simple_fold`] for the allocation behavior.
-fn fold_into_bytes(s: String) -> Vec<u8> {
-    let mut bytes = s.into_bytes();
-    // Tier 1 — full straight-through pass: lowercase every ASCII A..Z byte
-    // in place (a no-op on any non-ASCII byte, since `b.wrapping_sub(b'A')`
-    // is ≥ 26 for every byte outside 0x41..0x5A), and OR all bytes together
-    // so a single sign-bit test afterwards tells us whether the input
-    // contained any multibyte UTF-8 sequences. No early `break`, no
-    // input-dependent control flow — LLVM auto-vectorizes the loop.
-    let mut high_bit_acc: u8 = 0;
-    for b in &mut bytes {
-        high_bit_acc |= *b;
-        let is_upper = b.wrapping_sub(b'A') < 26;
-        *b |= u8::from(is_upper) << 5;
-    }
-    if high_bit_acc & 0x80 == 0 {
-        return bytes;
-    }
-    // Non-ASCII bytes are present. Locate the first one (SIMD-fast via
-    // `position`/memchr) and hand off to the UTF-8 path from there — the
-    // ASCII prefix is already lowercased and folding is idempotent on
-    // lower-case ASCII, so skipping it is purely an optimization.
-    let first_non_ascii = bytes
-        .iter()
-        .position(|&b| b & 0x80 != 0)
-        .expect("a non-ASCII byte exists (the high-bit accumulator was set)");
-    fold_non_ascii_tail(bytes, first_non_ascii)
-}
+mod index_fold;
+mod simple_fold;
+pub use index_fold::{index_fold, index_fold_char};
+pub use simple_fold::simple_fold;
 
-/// Tier 2 — copy-on-fold UTF-8 path. Scans the non-ASCII tail of the
-/// already-(ASCII-)lowercased `bytes` for the first character that actually
-/// folds to *different* bytes. Until one is found nothing is copied, so an
-/// input whose multibyte content never folds is returned in its original
-/// allocation untouched. Once a folding character is hit, a fresh buffer is
-/// allocated and the rest is built by bulk-copying each contiguous unmodified
-/// span and re-encoding the folded characters in between. The returned bytes
-/// are always valid UTF-8.
-///
-/// Characters are never fully decoded: the page index (`cp >> 6`) comes from
-/// the first one or two bytes for the `PAGE_BITMAP` reject, and on a page hit
-/// the remaining `cp & 0x3F` is read directly from the final byte to drive the
-/// within-page run search and byte-delta fold — no code-point reconstruction.
-fn fold_non_ascii_tail(bytes: Vec<u8>, start: usize) -> Vec<u8> {
-    let mut out: Vec<u8> = Vec::new();
-    let src = bytes.as_ptr();
-    // Raw write cursor into `out`'s buffer. Null until the first real fold
-    // allocates `out` (its pointer is then non-null), so `dst.is_null()` doubles
-    // as the "haven't started building the output yet" flag. We bypass the Vec
-    // push/reserve API: the buffer is reserved once for the worst case, so every
-    // copy/store below is unchecked.
-    let mut dst: *mut u8 = core::ptr::null_mut();
-    // `flushed` marks the start of the contiguous run of `bytes` that is
-    // already correct but not yet copied out.
-    let mut flushed = 0usize;
-    let mut read = start;
-    while read < bytes.len() {
-        // ASCII (already lowercased by pass 1) — unchanged, keep scanning.
-        let lead = bytes[read];
-        if lead & 0x80 == 0 {
-            read += 1;
-            continue;
-        }
-        // Page-precision reject probe (see the module docs).
-        let (page, c_len) = if lead < 0xE0 {
-            ((lead & 0x1F) as u32, 2usize)
-        } else if lead < 0xF0 {
-            (
-                (((lead & 0x0F) as u32) << 6) | (bytes[read + 1] & 0x3F) as u32,
-                3,
-            )
-        } else {
-            (
-                (((lead & 0x07) as u32) << 12)
-                    | (((bytes[read + 1] & 0x3F) as u32) << 6)
-                    | (bytes[read + 2] & 0x3F) as u32,
-                4,
-            )
-        };
-        let word_idx = (page >> 6) as usize;
-        if word_idx >= PAGE_BITMAP.len() || (PAGE_BITMAP[word_idx] >> (page & 63)) & 1 == 0 {
-            read += c_len;
-            continue;
-        }
-        let low_v = bytes[read + c_len - 1] & 0x3F;
-        let dense = popcount_up_to(page) as usize;
-        let lo = PAGE_OFFSET[dense] as usize;
-        let n = PAGE_OFFSET[dense + 1] as usize - lo;
-        let off = scan_end_low(lo, n, low_v);
-        let idx = if off < n {
-            // The scan guarantees `low_v <= end_low`; the run covers `low_v`
-            // iff `low_v >= start_low` (and, for stride 2, the offset is even).
-            // No code-point reconstruction — `low_v` is compared directly.
-            let ss = RUN_START_STRIDE[lo + off];
-            let start_low = ss & 0x3F;
-            let stride_bit = ss >> 6;
-            if low_v < start_low || ((low_v - start_low) & stride_bit) != 0 {
-                read += c_len;
-                continue;
-            }
-            lo + off
-        } else {
-            read += c_len;
-            continue;
-        };
-        // Load the character's bytes as a little-endian u32, mask off the lanes
-        // past it, add the run's constant byte delta. Over-reading 4 bytes is
-        // safe except within ≤3 bytes of the buffer end; the variable-length
-        // fallback there is far slower (a `memcpy` call per fold), so the fast
-        // path is worth the branch.
-        let raw = if read + 4 <= bytes.len() {
-            u32::from_le_bytes(bytes[read..read + 4].try_into().expect("4-byte slice"))
-        } else {
-            let mut w = [0u8; 4];
-            w[..c_len].copy_from_slice(&bytes[read..read + c_len]);
-            u32::from_le_bytes(w)
-        };
-        let word = raw & (u32::MAX >> ((4 - c_len) * 8));
-        let folded = word.wrapping_add(BYTE_DELTA[idx]);
-        let dest_len = utf8_len((folded & 0xFF) as u8);
-        if dst.is_null() {
-            // Reserve once for the worst case so the writes below never need a
-            // per-store capacity check. Output is at most 1.5× the input: the
-            // only folds that grow are U+023A/U+023E (2→3 bytes), so every 2
-            // input bytes yield ≤3 output bytes; `+ 4` covers the 4-byte
-            // over-store of the final character. The non-zero capacity makes
-            // `out.as_mut_ptr()` non-null, so `dst` is non-null from here on.
-            out = Vec::with_capacity(bytes.len() + bytes.len() / 2 + 4);
-            dst = out.as_mut_ptr();
-        }
-        // SAFETY: the buffer is reserved for the worst-case 1.5× output plus 4
-        // bytes of over-store headroom, so `dst` (the running output length)
-        // plus the 4-byte store stays in bounds for every iteration. `src` and
-        // `dst` are distinct allocations.
-        unsafe {
-            let run = read - flushed;
-            if run != 0 {
-                core::ptr::copy_nonoverlapping(src.add(flushed), dst, run);
-                dst = dst.add(run);
-            }
-            // Store a full 4-byte word, advance only by the real folded length.
-            dst.cast::<u32>().write_unaligned(folded.to_le());
-            dst = dst.add(dest_len);
-        }
-        read += c_len;
-        flushed = read;
-    }
-    if dst.is_null() {
-        // Nothing folded — return the original buffer with no extra copy.
-        return bytes;
-    }
-    // SAFETY: the trailing unmodified run fits in the reserved buffer; `dst`
-    // minus the base pointer is the total number of bytes written.
-    unsafe {
-        let tail = bytes.len() - flushed;
-        core::ptr::copy_nonoverlapping(src.add(flushed), dst, tail);
-        dst = dst.add(tail);
-        out.set_len(dst as usize - out.as_ptr() as usize);
-    }
-    out
-}
+use table::*;
 
 /// Number of bytes in the UTF-8 sequence whose lead byte is `lead`.
 ///
@@ -289,6 +100,7 @@ const fn table_size_bytes() -> usize {
         + RUN_END_LOW.len()
         + RUN_START_STRIDE.len()
         + BYTE_DELTA.len() * 4
+        + INDEX_DELTA.len()
 }
 
 // ---- Paged bitmap lookup ------------------------------------------------
@@ -310,13 +122,12 @@ const fn table_size_bytes() -> usize {
 //   RUN_START_STRIDE[i] = (start & PAGE_MASK) | ((stride - 1) << 6)
 //                                              (membership, vs `cp & 0x3F`)
 
-/// Number of populated pages strictly before `page`.
+/// Number of populated pages strictly before the page located at
+/// `PAGE_BITMAP[word_idx]` bit `bit_idx`.
 #[inline]
-fn popcount_up_to(page: u32) -> u32 {
-    let word_idx = (page / 64) as usize;
-    let bit_in_word = page % 64;
+fn popcount_up_to(word_idx: usize, bit_idx: u32) -> u32 {
     let base = POPCNT_SAMPLES[word_idx] as u32;
-    let partial = PAGE_BITMAP[word_idx] & ((1u64 << bit_in_word).wrapping_sub(1));
+    let partial = PAGE_BITMAP[word_idx] & ((1u64 << bit_idx).wrapping_sub(1));
     base + partial.count_ones()
 }
 
@@ -353,12 +164,13 @@ fn scan_end_low(lo: usize, n: usize, low_v: u8) -> usize {
 }
 
 #[cfg(test)]
-mod tests {
-    use super::*;
+pub(crate) mod test_support {
     use std::collections::HashMap;
     use std::fs;
 
-    fn reference() -> HashMap<u32, u32> {
+    /// Parse `data/CaseFolding.txt` into a simple-fold map (statuses `C` and
+    /// `S`), shared by the `simple_fold` and `index_fold` cross-check tests.
+    pub(crate) fn reference() -> HashMap<u32, u32> {
         let text = fs::read_to_string("data/CaseFolding.txt").expect("CaseFolding.txt");
         let mut out = HashMap::new();
         for raw in text.lines() {
@@ -390,18 +202,11 @@ mod tests {
         }
         out
     }
+}
 
-    /// Per-character fold via the reference map, used as the oracle for the
-    /// byte-oriented `fold_into_bytes` cross-checks below.
-    fn fold_oracle(r: &HashMap<u32, u32>, s: &str) -> Vec<u8> {
-        let mut out = String::new();
-        for c in s.chars() {
-            let cp = c as u32;
-            let folded = r.get(&cp).copied().unwrap_or(cp);
-            out.push(char::from_u32(folded).expect("reference fold is a valid char"));
-        }
-        out.into_bytes()
-    }
+#[cfg(test)]
+mod tests {
+    use super::*;
 
     #[test]
     fn table_is_compact() {
@@ -411,140 +216,4 @@ mod tests {
         eprintln!("table size: {sz} bytes for {NUM_FOLD_ENTRIES} entries");
         assert!(sz < 2400, "table size {sz} exceeds 2400 B budget");
     }
-
-    #[test]
-    fn fold_into_bytes_ascii() {
-        assert_eq!(fold_into_bytes(String::new()), b"");
-        assert_eq!(fold_into_bytes("Hello, WORLD!".into()), b"hello, world!");
-        assert_eq!(fold_into_bytes("abc 123 XYZ".into()), b"abc 123 xyz");
-    }
-
-    #[test]
-    fn simple_fold_returns_string() {
-        // Public `String` wrapper: ASCII, length-preserving, shrinking and
-        // growing folds all yield valid UTF-8.
-        assert_eq!(simple_fold("Hello, WORLD!".to_string()), "hello, world!");
-        assert_eq!(simple_fold("ÜBER Größe".to_string()), "über größe");
-        assert_eq!(simple_fold("\u{212A}elvin".to_string()), "kelvin");
-        assert_eq!(simple_fold("abc\u{023A}".to_string()), "abc\u{2C65}");
-        // Non-folding multibyte content is returned unchanged.
-        assert_eq!(simple_fold("漢字 שלום".to_string()), "漢字 שלום");
-    }
-
-    #[test]
-    fn fold_into_bytes_ascii_then_utf8_handoff() {
-        // ASCII prefix gets lowercased by the tier-1 loop, then control
-        // hands off to the tier-2 reallocating UTF-8 path at the first
-        // multibyte lead.
-        assert_eq!(
-            fold_into_bytes("MIXED Größe TEXT".into()),
-            "mixed größe text".as_bytes(),
-        );
-        // ASCII prefix, then a *shrinking* fold inside the tail.
-        assert_eq!(fold_into_bytes("LORD \u{212A}elvin".into()), b"lord kelvin",);
-        // ASCII prefix, then a *growing* fold.
-        assert_eq!(
-            fold_into_bytes("abc\u{023A}".into()),
-            "abc\u{2C65}".as_bytes(),
-        );
-    }
-
-    #[test]
-    fn fold_into_bytes_length_preserving_bmp() {
-        assert_eq!(fold_into_bytes("ÄÖÜ".into()), "äöü".as_bytes());
-        assert_eq!(fold_into_bytes("ΑΒΓ".into()), "αβγ".as_bytes());
-        assert_eq!(fold_into_bytes("漢字".into()), "漢字".as_bytes());
-    }
-
-    #[test]
-    fn fold_into_bytes_reuses_buffer_for_ascii_input() {
-        // Pure-ASCII inputs are lowercased in place — the returned Vec must
-        // hold the exact same allocation as the input String.
-        let s = "MIXED case AsCiI 12345".to_string();
-        let original_ptr = s.as_ptr();
-        let out = fold_into_bytes(s);
-        assert_eq!(out, b"mixed case ascii 12345");
-        assert_eq!(out.as_ptr(), original_ptr);
-    }
-
-    #[test]
-    fn fold_into_bytes_reuses_buffer_for_nonfolding_nonascii() {
-        // Non-ASCII content that never folds (CJK + Hebrew) plus ASCII upper
-        // case: the ASCII is lowercased in place and, because no multibyte
-        // character folds, the original allocation is handed back with no
-        // second buffer — same pointer as the input String.
-        let s = "HELLO 日本語 שלום WORLD".to_string();
-        let original_ptr = s.as_ptr();
-        let out = fold_into_bytes(s);
-        assert_eq!(out, "hello 日本語 שלום world".as_bytes());
-        assert_eq!(out.as_ptr(), original_ptr);
-    }
-
-    #[test]
-    fn fold_into_bytes_handles_shrinking_fold() {
-        // U+212A KELVIN SIGN (3 bytes) folds to U+006B 'k' (1 byte).
-        assert_eq!(fold_into_bytes("\u{212A}elvin".into()), b"kelvin");
-        // Shrink inside a longer string.
-        let out = fold_into_bytes("LORD \u{212A}elvin RULES".into());
-        assert_eq!(out, b"lord kelvin rules");
-        // U+2126 OHM SIGN (3 bytes) folds to U+03C9 'ω' (2 bytes).
-        assert_eq!(fold_into_bytes("\u{2126}".into()), "\u{03C9}".as_bytes());
-    }
-
-    #[test]
-    fn fold_into_bytes_handles_growing_fold() {
-        // The Unicode 16.0 simple-fold table has exactly two folds that
-        // grow in UTF-8 length (verified by scanning CaseFolding.txt):
-        // U+023A → U+2C65 and U+023E → U+2C66, both 2 B → 3 B.
-
-        // U+023A 'Ⱥ' is 2 bytes, folds to U+2C65 'ⱥ' = 3 bytes.
-        assert_eq!(fold_into_bytes("\u{023A}".into()), "\u{2C65}".as_bytes());
-        // U+023E 'Ⱦ' is 2 bytes, folds to U+2C66 'ⱦ' = 3 bytes.
-        assert_eq!(fold_into_bytes("\u{023E}".into()), "\u{2C66}".as_bytes());
-
-        // Each one mid-string, with mixed length-preserving context on both
-        // sides so that the bail-out path also copies a prefix that already
-        // contains a length-preserving rewrite.
-        let out = fold_into_bytes("ABC\u{023A}xyz".into());
-        assert_eq!(out, "abc\u{2C65}xyz".as_bytes());
-        let out = fold_into_bytes("ABC\u{023E}xyz".into());
-        assert_eq!(out, "abc\u{2C66}xyz".as_bytes());
-
-        // Both growing folds inside the same string: the second one occurs
-        // after we have already switched to the allocating buffer.
-        let out = fold_into_bytes("\u{023A}\u{023E}".into());
-        assert_eq!(out, "\u{2C65}\u{2C66}".as_bytes());
-
-        // Mixed: a length-preserving fold, then a shrinking fold, then both
-        // growing folds — exercises every branch in one input.
-        let out = fold_into_bytes("Ä\u{212A}\u{023A}\u{023E}".into());
-        assert_eq!(out, "ä\u{006B}\u{2C65}\u{2C66}".as_bytes());
-    }
-
-    #[test]
-    fn fold_into_bytes_matches_reference_map() {
-        // Cross-check against the reference fold map on a varied input.
-        let r = reference();
-        let input = "Quick BROWN Fox 🦊 ÜBER Größe ΣΟΦΙΑ \u{0130}\u{023A}漢";
-        assert_eq!(fold_into_bytes(input.to_string()), fold_oracle(&r, input));
-    }
-
-    #[test]
-    fn fold_into_bytes_matches_reference_map_exhaustive() {
-        // Drive every assigned code point through the byte-oriented fold path
-        // and cross-check against the reference fold map. This guarantees the
-        // UTF-8 lead-byte reject filter never skips a code point that actually
-        // folds (a false reject would corrupt output here). A leading 'X'
-        // forces the tier-2 UTF-8 tail to run from the very first char.
-        let r = reference();
-        let mut input = String::from("X");
-        for cp in 0x80..0x110000u32 {
-            if (0xD800..0xE000).contains(&cp) {
-                continue; // surrogates aren't valid chars
-            }
-            input.push(char::from_u32(cp).expect("cp is a valid non-surrogate char"));
-        }
-        let expected = fold_oracle(&r, &input);
-        assert_eq!(fold_into_bytes(input), expected);
-    }
 }
diff --git a/crates/casefold/src/simple_fold.rs b/crates/casefold/src/simple_fold.rs
new file mode 100644
index 0000000..81417de
--- /dev/null
+++ b/crates/casefold/src/simple_fold.rs
@@ -0,0 +1,352 @@
+//! Unicode simple case-folding to a `String`, built on the shared paged-bitmap
+//! run table.
+
+use crate::table::*;
+use crate::{popcount_up_to, scan_end_low, utf8_len};
+
+/// Consumes `s` and returns its simple case-folded form as a `String`. The
+/// input's heap buffer is reused untouched whenever folding changes no bytes —
+/// that covers pure-ASCII / already-lowercase input (folded in place) *and*
+/// any input whose multibyte characters never fold (CJK, Hangul, Kana,
+/// Arabic, Hebrew, Indic, symbols, …). A fresh buffer is allocated only once
+/// an actual case fold is encountered; from there, unmodified spans are
+/// bulk-copied and folded characters are re-encoded in between.
+///
+/// Folds may shrink (e.g. U+212A KELVIN SIGN is 3 bytes but folds to `k` =
+/// 1 byte) or grow (e.g. U+023A `Ⱥ` is 2 bytes but folds to U+2C65 `ⱥ` =
+/// 3 bytes), so in-place rewriting isn't possible in general — but inputs that
+/// don't fold at all skip the second buffer entirely.
+///
+/// Only **simple** (1-to-1) folds are applied; multi-character folds such as
+/// `ß` → `ss` and Turkic locale folds are left unchanged.
+///
+/// # Example
+///
+/// ```
+/// use casefold::simple_fold;
+/// assert_eq!(simple_fold("Hello, WORLD!".to_string()), "hello, world!");
+/// assert_eq!(simple_fold("ÜBER".to_string()), "über");
+/// // Length-changing fold (U+212A KELVIN SIGN → U+006B, 3 bytes → 1 byte):
+/// assert_eq!(simple_fold("\u{212A}elvin".to_string()), "kelvin");
+/// ```
+pub fn simple_fold(s: String) -> String {
+    // SAFETY: `fold_into_bytes` only lowercases ASCII bytes in place and
+    // re-encodes whole characters through the fold table, so its output is
+    // always valid UTF-8 (see the exhaustive round-trip test).
+    unsafe { String::from_utf8_unchecked(fold_into_bytes(s)) }
+}
+
+/// Byte-level core of [`simple_fold`]. Returns the fold as a `Vec<u8>` that is
+/// always valid UTF-8; see [`simple_fold`] for the allocation behavior.
+fn fold_into_bytes(s: String) -> Vec<u8> {
+    let mut bytes = s.into_bytes();
+    // Tier 1 — full straight-through pass: lowercase every ASCII A..Z byte
+    // in place (a no-op on any non-ASCII byte, since `b.wrapping_sub(b'A')`
+    // is ≥ 26 for every byte outside 0x41..0x5A), and OR all bytes together
+    // so a single sign-bit test afterwards tells us whether the input
+    // contained any multibyte UTF-8 sequences. No early `break`, no
+    // input-dependent control flow — LLVM auto-vectorizes the loop.
+    let mut high_bit_acc: u8 = 0;
+    for b in &mut bytes {
+        high_bit_acc |= *b;
+        let is_upper = b.wrapping_sub(b'A') < 26;
+        *b |= u8::from(is_upper) << 5;
+    }
+    if high_bit_acc & 0x80 == 0 {
+        return bytes;
+    }
+    // Non-ASCII bytes are present. Locate the first one (SIMD-fast via
+    // `position`/memchr) and hand off to the UTF-8 path from there — the
+    // ASCII prefix is already lowercased and folding is idempotent on
+    // lower-case ASCII, so skipping it is purely an optimization.
+    let first_non_ascii = bytes
+        .iter()
+        .position(|&b| b & 0x80 != 0)
+        .expect("a non-ASCII byte exists (the high-bit accumulator was set)");
+    fold_non_ascii_tail(bytes, first_non_ascii)
+}
+
+/// Tier 2 — copy-on-fold UTF-8 path. Scans the non-ASCII tail of the
+/// already-(ASCII-)lowercased `bytes` for the first character that actually
+/// folds to *different* bytes. Until one is found nothing is copied, so an
+/// input whose multibyte content never folds is returned in its original
+/// allocation untouched. Once a folding character is hit, a fresh buffer is
+/// allocated and the rest is built by bulk-copying each contiguous unmodified
+/// span and re-encoding the folded characters in between. The returned bytes
+/// are always valid UTF-8.
+///
+/// Characters are never fully decoded: the page index (`cp >> 6`) comes from
+/// the first one or two bytes for the `PAGE_BITMAP` reject, and on a page hit
+/// the remaining `cp & 0x3F` is read directly from the final byte to drive the
+/// within-page run search and byte-delta fold — no code-point reconstruction.
+fn fold_non_ascii_tail(bytes: Vec<u8>, start: usize) -> Vec<u8> {
+    let mut out: Vec<u8> = Vec::new();
+    let src = bytes.as_ptr();
+    // Raw write cursor into `out`'s buffer. Null until the first real fold
+    // allocates `out` (its pointer is then non-null), so `dst.is_null()` doubles
+    // as the "haven't started building the output yet" flag. We bypass the Vec
+    // push/reserve API: the buffer is reserved once for the worst case, so every
+    // copy/store below is unchecked.
+    let mut dst: *mut u8 = core::ptr::null_mut();
+    // `flushed` marks the start of the contiguous run of `bytes` that is
+    // already correct but not yet copied out.
+    let mut flushed = 0usize;
+    let mut read = start;
+    while read < bytes.len() {
+        // ASCII (already lowercased by pass 1) — unchanged, keep scanning.
+        let lead = bytes[read];
+        if lead & 0x80 == 0 {
+            read += 1;
+            continue;
+        }
+        // Page-precision reject probe (see the module docs). Recover the
+        // `PAGE_BITMAP` coordinates of `cp >> 6` directly as `(word_idx,
+        // bit_idx)` — `cp >> 12` indexes the bitmap word and `(cp >> 6) & 63`
+        // the bit — without materializing the combined page number.
+        let (word_idx, bit_idx, c_len) = if lead < 0xE0 {
+            (0usize, (lead & 0x1F) as u32, 2usize)
+        } else if lead < 0xF0 {
+            ((lead & 0x0F) as usize, (bytes[read + 1] & 0x3F) as u32, 3)
+        } else {
+            (
+                (((lead & 0x07) as usize) << 6) | (bytes[read + 1] & 0x3F) as usize,
+                (bytes[read + 2] & 0x3F) as u32,
+                4,
+            )
+        };
+        if word_idx >= PAGE_BITMAP.len() || (PAGE_BITMAP[word_idx] >> bit_idx) & 1 == 0 {
+            read += c_len;
+            continue;
+        }
+        let low_v = bytes[read + c_len - 1] & 0x3F;
+        let dense = popcount_up_to(word_idx, bit_idx) as usize;
+        let lo = PAGE_OFFSET[dense] as usize;
+        let n = PAGE_OFFSET[dense + 1] as usize - lo;
+        let off = scan_end_low(lo, n, low_v);
+        let idx = if off < n {
+            // The scan guarantees `low_v <= end_low`; the run covers `low_v`
+            // iff `low_v >= start_low` (and, for stride 2, the offset is even).
+            // No code-point reconstruction — `low_v` is compared directly.
+            let ss = RUN_START_STRIDE[lo + off];
+            let start_low = ss & 0x3F;
+            let stride_bit = ss >> 6;
+            if low_v < start_low || ((low_v - start_low) & stride_bit) != 0 {
+                read += c_len;
+                continue;
+            }
+            lo + off
+        } else {
+            read += c_len;
+            continue;
+        };
+        // Load the character's bytes as a little-endian u32, mask off the lanes
+        // past it, add the run's constant byte delta. Over-reading 4 bytes is
+        // safe except within ≤3 bytes of the buffer end; the variable-length
+        // fallback there is far slower (a `memcpy` call per fold), so the fast
+        // path is worth the branch.
+        let raw = if read + 4 <= bytes.len() {
+            u32::from_le_bytes(bytes[read..read + 4].try_into().expect("4-byte slice"))
+        } else {
+            let mut w = [0u8; 4];
+            w[..c_len].copy_from_slice(&bytes[read..read + c_len]);
+            u32::from_le_bytes(w)
+        };
+        let word = raw & (u32::MAX >> ((4 - c_len) * 8));
+        let folded = word.wrapping_add(BYTE_DELTA[idx]);
+        let dest_len = utf8_len((folded & 0xFF) as u8);
+        if dst.is_null() {
+            // Reserve once for the worst case so the writes below never need a
+            // per-store capacity check. Output is at most 1.5× the input: the
+            // only folds that grow are U+023A/U+023E (2→3 bytes), so every 2
+            // input bytes yield ≤3 output bytes; `+ 4` covers the 4-byte
+            // over-store of the final character. The non-zero capacity makes
+            // `out.as_mut_ptr()` non-null, so `dst` is non-null from here on.
+            out = Vec::with_capacity(bytes.len() + bytes.len() / 2 + 4);
+            dst = out.as_mut_ptr();
+        }
+        // SAFETY: the buffer is reserved for the worst-case 1.5× output plus 4
+        // bytes of over-store headroom, so `dst` (the running output length)
+        // plus the 4-byte store stays in bounds for every iteration. `src` and
+        // `dst` are distinct allocations.
+        unsafe {
+            let run = read - flushed;
+            if run != 0 {
+                core::ptr::copy_nonoverlapping(src.add(flushed), dst, run);
+                dst = dst.add(run);
+            }
+            // Store a full 4-byte word, advance only by the real folded length.
+            dst.cast::<u32>().write_unaligned(folded.to_le());
+            dst = dst.add(dest_len);
+        }
+        read += c_len;
+        flushed = read;
+    }
+    if dst.is_null() {
+        // Nothing folded — return the original buffer with no extra copy.
+        return bytes;
+    }
+    // SAFETY: the trailing unmodified run fits in the reserved buffer; `dst`
+    // minus the base pointer is the total number of bytes written.
+    unsafe {
+        let tail = bytes.len() - flushed;
+        core::ptr::copy_nonoverlapping(src.add(flushed), dst, tail);
+        dst = dst.add(tail);
+        out.set_len(dst as usize - out.as_ptr() as usize);
+    }
+    out
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+    use crate::test_support::reference;
+    use std::collections::HashMap;
+
+    /// Per-character fold via the reference map, used as the oracle for the
+    /// byte-oriented `fold_into_bytes` cross-checks below.
+    fn fold_oracle(r: &HashMap<u32, u32>, s: &str) -> Vec<u8> {
+        let mut out = String::new();
+        for c in s.chars() {
+            let cp = c as u32;
+            let folded = r.get(&cp).copied().unwrap_or(cp);
+            out.push(char::from_u32(folded).expect("reference fold is a valid char"));
+        }
+        out.into_bytes()
+    }
+
+    #[test]
+    fn fold_into_bytes_ascii() {
+        assert_eq!(fold_into_bytes(String::new()), b"");
+        assert_eq!(fold_into_bytes("Hello, WORLD!".into()), b"hello, world!");
+        assert_eq!(fold_into_bytes("abc 123 XYZ".into()), b"abc 123 xyz");
+    }
+
+    #[test]
+    fn simple_fold_returns_string() {
+        // Public `String` wrapper: ASCII, length-preserving, shrinking and
+        // growing folds all yield valid UTF-8.
+        assert_eq!(simple_fold("Hello, WORLD!".to_string()), "hello, world!");
+        assert_eq!(simple_fold("ÜBER Größe".to_string()), "über größe");
+        assert_eq!(simple_fold("\u{212A}elvin".to_string()), "kelvin");
+        assert_eq!(simple_fold("abc\u{023A}".to_string()), "abc\u{2C65}");
+        // Non-folding multibyte content is returned unchanged.
+        assert_eq!(simple_fold("漢字 שלום".to_string()), "漢字 שלום");
+    }
+
+    #[test]
+    fn fold_into_bytes_ascii_then_utf8_handoff() {
+        // ASCII prefix gets lowercased by the tier-1 loop, then control
+        // hands off to the tier-2 reallocating UTF-8 path at the first
+        // multibyte lead.
+        assert_eq!(
+            fold_into_bytes("MIXED Größe TEXT".into()),
+            "mixed größe text".as_bytes(),
+        );
+        // ASCII prefix, then a *shrinking* fold inside the tail.
+        assert_eq!(fold_into_bytes("LORD \u{212A}elvin".into()), b"lord kelvin",);
+        // ASCII prefix, then a *growing* fold.
+        assert_eq!(
+            fold_into_bytes("abc\u{023A}".into()),
+            "abc\u{2C65}".as_bytes(),
+        );
+    }
+
+    #[test]
+    fn fold_into_bytes_length_preserving_bmp() {
+        assert_eq!(fold_into_bytes("ÄÖÜ".into()), "äöü".as_bytes());
+        assert_eq!(fold_into_bytes("ΑΒΓ".into()), "αβγ".as_bytes());
+        assert_eq!(fold_into_bytes("漢字".into()), "漢字".as_bytes());
+    }
+
+    #[test]
+    fn fold_into_bytes_reuses_buffer_for_ascii_input() {
+        // Pure-ASCII inputs are lowercased in place — the returned Vec must
+        // hold the exact same allocation as the input String.
+        let s = "MIXED case AsCiI 12345".to_string();
+        let original_ptr = s.as_ptr();
+        let out = fold_into_bytes(s);
+        assert_eq!(out, b"mixed case ascii 12345");
+        assert_eq!(out.as_ptr(), original_ptr);
+    }
+
+    #[test]
+    fn fold_into_bytes_reuses_buffer_for_nonfolding_nonascii() {
+        // Non-ASCII content that never folds (CJK + Hebrew) plus ASCII upper
+        // case: the ASCII is lowercased in place and, because no multibyte
+        // character folds, the original allocation is handed back with no
+        // second buffer — same pointer as the input String.
+        let s = "HELLO 日本語 שלום WORLD".to_string();
+        let original_ptr = s.as_ptr();
+        let out = fold_into_bytes(s);
+        assert_eq!(out, "hello 日本語 שלום world".as_bytes());
+        assert_eq!(out.as_ptr(), original_ptr);
+    }
+
+    #[test]
+    fn fold_into_bytes_handles_shrinking_fold() {
+        // U+212A KELVIN SIGN (3 bytes) folds to U+006B 'k' (1 byte).
+        assert_eq!(fold_into_bytes("\u{212A}elvin".into()), b"kelvin");
+        // Shrink inside a longer string.
+        let out = fold_into_bytes("LORD \u{212A}elvin RULES".into());
+        assert_eq!(out, b"lord kelvin rules");
+        // U+2126 OHM SIGN (3 bytes) folds to U+03C9 'ω' (2 bytes).
+        assert_eq!(fold_into_bytes("\u{2126}".into()), "\u{03C9}".as_bytes());
+    }
+
+    #[test]
+    fn fold_into_bytes_handles_growing_fold() {
+        // The Unicode 16.0 simple-fold table has exactly two folds that
+        // grow in UTF-8 length (verified by scanning CaseFolding.txt):
+        // U+023A → U+2C65 and U+023E → U+2C66, both 2 B → 3 B.
+
+        // U+023A 'Ⱥ' is 2 bytes, folds to U+2C65 'ⱥ' = 3 bytes.
+        assert_eq!(fold_into_bytes("\u{023A}".into()), "\u{2C65}".as_bytes());
+        // U+023E 'Ⱦ' is 2 bytes, folds to U+2C66 'ⱦ' = 3 bytes.
+        assert_eq!(fold_into_bytes("\u{023E}".into()), "\u{2C66}".as_bytes());
+
+        // Each one mid-string, with mixed length-preserving context on both
+        // sides so that the bail-out path also copies a prefix that already
+        // contains a length-preserving rewrite.
+        let out = fold_into_bytes("ABC\u{023A}xyz".into());
+        assert_eq!(out, "abc\u{2C65}xyz".as_bytes());
+        let out = fold_into_bytes("ABC\u{023E}xyz".into());
+        assert_eq!(out, "abc\u{2C66}xyz".as_bytes());
+
+        // Both growing folds inside the same string: the second one occurs
+        // after we have already switched to the allocating buffer.
+        let out = fold_into_bytes("\u{023A}\u{023E}".into());
+        assert_eq!(out, "\u{2C65}\u{2C66}".as_bytes());
+
+        // Mixed: a length-preserving fold, then a shrinking fold, then both
+        // growing folds — exercises every branch in one input.
+        let out = fold_into_bytes("Ä\u{212A}\u{023A}\u{023E}".into());
+        assert_eq!(out, "ä\u{006B}\u{2C65}\u{2C66}".as_bytes());
+    }
+
+    #[test]
+    fn fold_into_bytes_matches_reference_map() {
+        // Cross-check against the reference fold map on a varied input.
+        let r = reference();
+        let input = "Quick BROWN Fox 🦊 ÜBER Größe ΣΟΦΙΑ \u{0130}\u{023A}漢";
+        assert_eq!(fold_into_bytes(input.to_string()), fold_oracle(&r, input));
+    }
+
+    #[test]
+    fn fold_into_bytes_matches_reference_map_exhaustive() {
+        // Drive every assigned code point through the byte-oriented fold path
+        // and cross-check against the reference fold map. This guarantees the
+        // UTF-8 lead-byte reject filter never skips a code point that actually
+        // folds (a false reject would corrupt output here). A leading 'X'
+        // forces the tier-2 UTF-8 tail to run from the very first char.
+        let r = reference();
+        let mut input = String::from("X");
+        for cp in 0x80..0x110000u32 {
+            if (0xD800..0xE000).contains(&cp) {
+                continue; // surrogates aren't valid chars
+            }
+            input.push(char::from_u32(cp).expect("cp is a valid non-surrogate char"));
+        }
+        let expected = fold_oracle(&r, &input);
+        assert_eq!(fold_into_bytes(input), expected);
+    }
+}