diff --git a/crates/casefold/README.md b/crates/casefold/README.md index 211fe24..6487f58 100644 --- a/crates/casefold/README.md +++ b/crates/casefold/README.md @@ -9,6 +9,9 @@ multiple GiB/s — several × faster than a `HashMap` fold table — while using form, as defined by the Unicode [CaseFolding.txt][cf] data file restricted to the **simple** (1-to-1) folds (statuses `C` and `S`). Full multi-character folds (`F`, e.g. `ß` → `ss`) and Turkic locale folds (`T`) are not supported. +The crate also provides [`index_fold`](#single-byte-index-fold), which projects +every character — ASCII or multibyte — onto a single byte, a handy primitive for +case-insensitive n-gram indexing. [cf]: https://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt @@ -25,6 +28,56 @@ assert_eq!(simple_fold("Hello, WORLD!".to_string()), "hello, world!"); assert_eq!(simple_fold("ÜBER".to_string()), "über"); ``` +## Single-byte index fold + +`index_fold(s: String) -> Vec` applies the **same** simple fold as +`simple_fold`, then collapses **every character to exactly one byte**: + +- ASCII characters become their plain lowercased byte (high bit clear). +- Every multibyte character becomes `0x80 | (cp & 0x7F)` — the low 7 bits of its + *folded* code point, with the high bit set. The high bit is set + unconditionally, so even a multibyte character that folds to ASCII (e.g. + U+212A KELVIN SIGN → `k`) yields `0x80 | b'k'`, never a bare ASCII byte. + +```rust +use casefold::index_fold; +assert_eq!(index_fold("Hi!".to_string()), b"hi!"); +assert_eq!(index_fold("Ü".to_string()), &[0xFC]); // ü → 0x80 | (0xFC & 0x7F) +assert_eq!(index_fold("中".to_string()), &[0x80 | 0x2D]); +``` + +The result is fixed-width (one byte per character) and is therefore **not** +valid UTF-8. To fold a single code point, use `index_fold_char(c: char) -> u8`, +which returns the same byte `index_fold` would emit for that character. + +### Why one byte per character? + +This is a building block for **case-insensitive n-gram indexing**. When every +character — ASCII or not — is reduced to a single byte, a fixed *k*-gram is just +*k* contiguous bytes: byte n-grams are trivial to slice, hash, and store, they +are already case-folded so lookups are case-insensitive for free, and a document +of *n* characters yields exactly *n* index bytes. ASCII keeps its natural byte, +and multibyte scripts are projected onto the high half (`0x80–0xFF`) so they +never collide with ASCII. + +The projection is intentionally **lossy** — distinct code points that share the +same low 7 bits map to the same byte (most CJK, for instance, lands in a narrow +band). That is fine for an index: use `index_fold` as a cheap *candidate filter* +that never produces false negatives for a case-insensitive match, then verify +exact hits against the original text afterwards. + +Mechanically it reuses the whole fold table; the only addition is a per-run +7-bit `INDEX_DELTA`. By modular arithmetic the folded low 7 bits are +`((cp & 0x7F) + (delta & 0x7F)) mod 128`, so the fold is a single +`wrapping_add` — no UTF-8 reconstruction, no decode, no encode (the stray carry +bit is overwritten by the unconditional `0x80 |`). Because the output is never +longer than the input, it runs fully in place in the input's own buffer, and +pure-ASCII input is returned untouched. It shares `simple_fold`'s +auto-vectorized ASCII pass (~46 GiB/s) and, since it emits one byte per +character, runs *faster* than `simple_fold` on folding-heavy input (e.g. ~1.9 +vs ~1.3 GiB/s on length-changing folds, ~1.1 vs ~0.9 GiB/s on mixed BMP) and a +little slower on pure-reject CJK/symbols due to character collapsing. + ## Why does this crate exist? Unicode 16.0 defines 1484 simple-fold mappings. Common ways to store them: @@ -68,7 +121,7 @@ query: `wrapping_add`, one 4-byte store — no decode, no encode. Writing fewer/more bytes than were read handles length-changing folds (`K`→`k`, `Ⱥ`→`ⱥ`). -### Table layout (1776 B total) +### Table layout (2014 B total) | Component | Bytes | |-------------------------------------------------|------:| @@ -78,8 +131,11 @@ query: | `RUN_END_LOW[238 + 8]: u8` (clean scan key, `end & 0x3F`; +8 SWAR pad) | 246 | | `RUN_START_STRIDE[238]: u8` (`start & 0x3F` \| stride bit) | 238 | | `BYTE_DELTA[238]: u32` (little-endian fold delta per run) | 952 | -| **Total** | **1776** | +| `INDEX_DELTA[238]: u8` (7-bit per-run fold delta, `index_fold` only) | 238 | +| **Total** | **2014** | +The `simple_fold` path uses 1776 B of this; the 238 B `INDEX_DELTA` side table +powers [`index_fold`](#single-byte-index-fold) only. (Splitting runs at byte-delta boundaries raises the run count from 227 to 238.) The data file is parsed at build time by `build.rs`, which emits the packed `static` tables to `OUT_DIR/table.rs`. diff --git a/crates/casefold/benchmarks/conversion.rs b/crates/casefold/benchmarks/conversion.rs index 8a73c76..de7cab4 100644 --- a/crates/casefold/benchmarks/conversion.rs +++ b/crates/casefold/benchmarks/conversion.rs @@ -1,7 +1,8 @@ //! Benchmarks for `casefold::simple_fold`, comparing it against several -//! baselines on representative inputs. Each input is run through six variants: +//! baselines on representative inputs. Each input is run through these variants: //! //! - `casefold::simple_fold` — the implementation under test. +//! - `casefold::index_fold` — the one-byte-per-character index fold. //! - `HashMap::fold_into_bytes` — a HashMap-based case fold over raw UTF-8. //! - `str::to_lowercase` — straightforward Unicode lowercasing baseline. //! - `chars().flat_map(to_lowercase)` — the per-char flat-map variant. @@ -13,7 +14,7 @@ //! cases (e.g. `Σ` final-sigma context, `İ` → `i\u{0307}`). These benchmarks //! are about throughput on equivalent workloads, not output equality. -use casefold::{simple_fold, utf8_len}; +use casefold::{index_fold, simple_fold, utf8_len}; use casefold_benchmarks::{hashmap_fold_utf8, reference_map_utf8, FoldHashMap}; use criterion::{criterion_group, criterion_main, BenchmarkId, Criterion, Throughput}; use std::hint::black_box; @@ -156,6 +157,14 @@ fn bench_conversion(c: &mut Criterion, name: &str, input: &str) { }, ); + group.bench_function(BenchmarkId::new("Casefold::index_fold", input.len()), |b| { + b.iter_batched( + || input.to_string(), + |s| index_fold(black_box(s)), + criterion::BatchSize::SmallInput, + ); + }); + let fold_map = reference_map_utf8(); group.bench_function( BenchmarkId::new("HashMap::fold_into_bytes (UTF-8 u32)", input.len()), diff --git a/crates/casefold/build.rs b/crates/casefold/build.rs index 17a62e7..be90b22 100644 --- a/crates/casefold/build.rs +++ b/crates/casefold/build.rs @@ -307,9 +307,21 @@ fn emit_tables(folds: &[Fold], runs: &[Run]) -> String { .max() .unwrap_or(0); + // Parallel 7-bit index deltas, one per run, for `index_fold`. The fold + // collapses each code point to `cp & 0x7F`; by modular arithmetic the folded + // low-7-bit value is `((cp & 0x7F) + (delta & 0x7F)) & 0x7F`, so storing the + // code-point delta reduced mod 128 lets `index_fold` derive the folded index + // byte with one `wrapping_add` + mask — no UTF-8 reconstruction. The high + // bit is added unconditionally at write time, so only 7 bits are stored. + let index_deltas: Vec = runs.iter().map(|r| (r.delta & 0x7F) as u8).collect(); + // Sanity: size accounting (printed as build warnings for visibility). let index_bytes = page_bitmap.len() * 8 + popcnt_samples.len() + page_offset.len(); - let total = index_bytes + run_end_low.len() + run_start_stride.len() + byte_deltas.len() * 4; + let total = index_bytes + + run_end_low.len() + + run_start_stride.len() + + byte_deltas.len() * 4 + + index_deltas.len(); if env::var_os("CASEFOLD_BUILD_INFO").is_some() { println!( "cargo:warning=casefold table: {} fold entries, {} runs, {} populated pages, {} bytes total ({:.2} bits/entry), max |delta| = {}, max |byte_delta| = {}", @@ -338,6 +350,7 @@ fn emit_tables(folds: &[Fold], runs: &[Run]) -> String { emit_u8_array(&mut s, "RUN_END_LOW", &run_end_low); emit_u8_array(&mut s, "RUN_START_STRIDE", &run_start_stride); emit_u32_array(&mut s, "BYTE_DELTA", &byte_deltas); + emit_u8_array(&mut s, "INDEX_DELTA", &index_deltas); s } diff --git a/crates/casefold/src/index_fold.rs b/crates/casefold/src/index_fold.rs new file mode 100644 index 0000000..fddb2be --- /dev/null +++ b/crates/casefold/src/index_fold.rs @@ -0,0 +1,287 @@ +//! Compact one-byte-per-character *index* fold, built on the same paged-bitmap +//! run table as [`simple_fold`](crate::simple_fold). + +use crate::table::*; +use crate::{popcount_up_to, scan_end_low}; + +/// Consumes `s` and returns its simple case-folded form as a compact byte +/// *index*: each character is folded with the same simple (1-to-1) fold as +/// [`simple_fold`](crate::simple_fold), then collapsed to **exactly one byte** +/// per input character. +/// +/// Single-byte (ASCII) characters are emitted as their plain lowercased byte +/// (high bit clear). Every multibyte character is replaced by the single byte +/// `0x80 | (cp & 0x7F)`: the low 7 bits of its *folded* code point with the high +/// bit set. The high bit is set unconditionally, so a multibyte character that +/// folds to ASCII (e.g. U+212A KELVIN SIGN → `k`) still yields a high-bit byte +/// (`0x80 | b'k'`), not the bare ASCII byte. +/// +/// The result has one byte per character and is therefore **not** valid UTF-8. +/// It is intended as a cheap, fixed-width key for case-insensitive indexing or +/// hashing where collisions between code points sharing the same low 7 bits are +/// acceptable. +/// +/// Because every character collapses to exactly one byte, the output is never +/// longer than the input; pure-ASCII input is folded in place (the input's heap +/// buffer is returned untouched), and once a multibyte character is hit the +/// remainder is rewritten in place with a write cursor that never overtakes the +/// read cursor, so no second buffer is ever allocated. +/// +/// Like [`simple_fold`](crate::simple_fold), characters are never fully decoded +/// and the fold needs no UTF-8 reconstruction: the page coordinates come from +/// the lead/continuation bytes, and on a fold hit the folded low 7 bits are +/// `(cp & 0x7F)` plus the run's 7-bit `INDEX_DELTA`, masked back to 7 bits. +/// +/// # Example +/// +/// ``` +/// use casefold::index_fold; +/// assert_eq!(index_fold("Hi!".to_string()), b"hi!"); +/// // U+212A KELVIN SIGN folds to ASCII 'k', but the high bit is still set: +/// assert_eq!(index_fold("\u{212A}".to_string()), &[0x80 | b'k']); +/// // 'Ü' (U+00DC) folds to 'ü' (U+00FC); 0x80 | (0xFC & 0x7F) == 0xFC: +/// assert_eq!(index_fold("Ü".to_string()), &[0xFC]); +/// ``` +pub fn index_fold(s: String) -> Vec { + let mut bytes = s.into_bytes(); + // Tier 1 — vectorizable straight-through pass (identical to `fold_into_bytes`): + // lowercase every ASCII A..Z byte in place and OR all bytes together so a + // single sign-bit test tells us whether any multibyte sequence is present. + let mut high_bit_acc: u8 = 0; + for b in &mut bytes { + high_bit_acc |= *b; + let is_upper = b.wrapping_sub(b'A') < 26; + *b |= u8::from(is_upper) << 5; + } + if high_bit_acc & 0x80 == 0 { + // Pure ASCII: already folded in place, one byte per character. + return bytes; + } + // Tier 2 — collapse each character to one index byte, in place. The ASCII + // prefix (already lowercased above, one byte per char) is left untouched; + // from the first non-ASCII byte we rewrite with a `write` cursor that, since + // every character yields exactly one byte from its >= 1 source bytes, never + // overtakes `read`. + let first_non_ascii = bytes + .iter() + .position(|&b| b & 0x80 != 0) + .expect("a non-ASCII byte exists (the high-bit accumulator was set)"); + let mut write = first_non_ascii; + let mut read = first_non_ascii; + while read < bytes.len() { + let lead = bytes[read]; + // ASCII (already lowercased by tier 1): copy through as a single byte. + if lead & 0x80 == 0 { + bytes[write] = lead; + write += 1; + read += 1; + continue; + } + // Multibyte: recover the `PAGE_BITMAP` coordinates of `cp >> 6` + // directly as `(word_idx, bit_idx)` — the high part `cp >> 12` indexes + // the bitmap word, the next 6 bits `(cp >> 6) & 63` index the bit — + // without ever materializing the combined page number. + let (word_idx, bit_idx, c_len) = if lead < 0xE0 { + (0usize, (lead & 0x1F) as u32, 2usize) + } else if lead < 0xF0 { + ((lead & 0x0F) as usize, (bytes[read + 1] & 0x3F) as u32, 3) + } else { + ( + (((lead & 0x07) as usize) << 6) | (bytes[read + 1] & 0x3F) as usize, + (bytes[read + 2] & 0x3F) as u32, + 4, + ) + }; + let low_v = bytes[read + c_len - 1] & 0x3F; + // The source code point's low 7 bits, `cp & 0x7F`, as `((cp >> 6) & 1) + // << 6 | (cp & 0x3F)`: `bit_idx`'s low bit is `(cp >> 6) & 1`. We don't + // mask `bit_idx` to one bit — its higher bits land in output bit 7+, + // which the unconditional `0x80 |` at write time overwrites anyway. + let mut folded_index = ((bit_idx << 6) as u8) | low_v; + if word_idx < PAGE_BITMAP.len() && (PAGE_BITMAP[word_idx] >> bit_idx) & 1 != 0 { + let dense = popcount_up_to(word_idx, bit_idx) as usize; + let lo = PAGE_OFFSET[dense] as usize; + let n = PAGE_OFFSET[dense + 1] as usize - lo; + let off = scan_end_low(lo, n, low_v); + if off < n { + let ss = RUN_START_STRIDE[lo + off]; + let start_low = ss & 0x3F; + let stride_bit = ss >> 6; + if low_v >= start_low && ((low_v - start_low) & stride_bit) == 0 { + // Folding character: by modular arithmetic the folded low 7 + // bits are `(cp & 0x7F) + (delta & 0x7F)) mod 128`, so adding + // the run's 7-bit `INDEX_DELTA` yields them directly — no UTF-8 + // reconstruction. The add may carry into bit 7, but that bit + // is overwritten by `0x80 |` below, so no `& 0x7F` is needed. + folded_index = folded_index.wrapping_add(INDEX_DELTA[lo + off]); + } + } + } + // `write <= read` here, and the source bytes this character needs were + // all read above, so storing the single index byte never clobbers + // bytes still to be read. The high bit always marks a multibyte origin. + bytes[write] = 0x80 | folded_index; + write += 1; + read += c_len; + } + bytes.truncate(write); + bytes +} + +/// Folds a single `char` to its one-byte [`index_fold`] representation. +/// +/// Equivalent to the per-character output of [`index_fold`]: an ASCII `char` +/// yields its lowercased byte (high bit clear); any other `char` yields +/// `0x80 | (cp & 0x7F)` of its *folded* code point (high bit set), including a +/// multibyte `char` that folds to ASCII (e.g. U+212A KELVIN SIGN → `0x80 | b'k'`). +/// +/// # Example +/// +/// ``` +/// use casefold::index_fold_char; +/// assert_eq!(index_fold_char('A'), b'a'); +/// assert_eq!(index_fold_char('Ü'), 0xFC); // ü → 0x80 | (0xFC & 0x7F) +/// assert_eq!(index_fold_char('中'), 0x80 | 0x2D); +/// ``` +pub fn index_fold_char(c: char) -> u8 { + let cp = c as u32; + if cp < 0x80 { + // ASCII: lowercase A..Z (a no-op otherwise), high bit stays clear. + let b = cp as u8; + let is_upper = b.wrapping_sub(b'A') < 26; + return b | (u8::from(is_upper) << 5); + } + // Multibyte: the `PAGE_BITMAP` coordinates of `cp >> 6` are `word_idx = + // cp >> 12` (the bitmap word) and `bit_idx = (cp >> 6) & 63` (the bit). + let word_idx = (cp >> 12) as usize; + let bit_idx = (cp >> 6) & 0x3F; + let low_v = (cp & 0x3F) as u8; + // `cp & 0x7F` is the source low 7 bits; a fold adds the run's 7-bit delta. + let mut folded_index = (cp & 0x7F) as u8; + if word_idx < PAGE_BITMAP.len() && (PAGE_BITMAP[word_idx] >> bit_idx) & 1 != 0 { + let dense = popcount_up_to(word_idx, bit_idx) as usize; + let lo = PAGE_OFFSET[dense] as usize; + let n = PAGE_OFFSET[dense + 1] as usize - lo; + let off = scan_end_low(lo, n, low_v); + if off < n { + let ss = RUN_START_STRIDE[lo + off]; + let start_low = ss & 0x3F; + let stride_bit = ss >> 6; + if low_v >= start_low && ((low_v - start_low) & stride_bit) == 0 { + folded_index = folded_index.wrapping_add(INDEX_DELTA[lo + off]); + } + } + } + 0x80 | folded_index +} + +#[cfg(test)] +mod tests { + use super::*; + use crate::test_support::reference; + use std::collections::HashMap; + + /// Per-character index fold via the reference map: fold each char, then + /// collapse it to one byte the same way [`index_fold`] does. The high bit is + /// set for every multibyte (source `cp >= 0x80`) character, even one that + /// folds to ASCII. + fn index_fold_oracle(r: &HashMap, s: &str) -> Vec { + let mut out = Vec::new(); + for c in s.chars() { + let cp = c as u32; + let folded = r.get(&cp).copied().unwrap_or(cp); + if cp < 0x80 { + out.push(folded as u8); + } else { + out.push(0x80 | (folded & 0x7F) as u8); + } + } + out + } + + #[test] + fn index_fold_ascii() { + assert_eq!(index_fold(String::new()), b""); + assert_eq!(index_fold("Hello, WORLD!".into()), b"hello, world!"); + assert_eq!(index_fold("abc 123 XYZ".into()), b"abc 123 xyz"); + } + + #[test] + fn index_fold_reuses_buffer_for_ascii_input() { + // Pure-ASCII input is folded in place; the returned Vec must hold the + // exact same allocation as the input String. + let s = "MIXED case AsCiI 12345".to_string(); + let original_ptr = s.as_ptr(); + let out = index_fold(s); + assert_eq!(out, b"mixed case ascii 12345"); + assert_eq!(out.as_ptr(), original_ptr); + } + + #[test] + fn index_fold_multibyte_to_single_byte() { + // Ü (U+00DC) folds to ü (U+00FC); 0x80 | (0xFC & 0x7F) == 0xFC. + assert_eq!(index_fold("Ü".into()), vec![0xFC]); + // Length-preserving fold of three 2-byte chars to one byte each. + assert_eq!( + index_fold("ÄÖÜ".into()), + vec![0x80 | 0x64, 0x80 | 0x76, 0xFC] + ); + // Fold to ASCII keeps the high bit set: U+212A KELVIN SIGN -> 'k'. + assert_eq!( + index_fold("\u{212A}elvin".into()), + vec![0x80 | b'k', b'e', b'l', b'v', b'i', b'n'], + ); + // Growing fold U+023A -> U+2C65: 0x80 | (0x2C65 & 0x7F) == 0xE5. + assert_eq!(index_fold("\u{023A}".into()), vec![0x80 | 0x65]); + // Non-folding multibyte still collapses to its low 7 bits. + assert_eq!(index_fold("中".into()), vec![0x80 | 0x2D]); + } + + #[test] + fn index_fold_matches_reference_map() { + let r = reference(); + let input = "Quick BROWN Fox 🦊 ÜBER Größe ΣΟΦΙΑ \u{0130}\u{023A}漢"; + assert_eq!(index_fold(input.to_string()), index_fold_oracle(&r, input)); + } + + #[test] + fn index_fold_matches_reference_map_exhaustive() { + // Drive every assigned code point through the byte-oriented index path + // and cross-check against the reference fold map. + let r = reference(); + let mut input = String::from("X"); + for cp in 0x80..0x110000u32 { + if (0xD800..0xE000).contains(&cp) { + continue; // surrogates aren't valid chars + } + input.push(char::from_u32(cp).expect("cp is a valid non-surrogate char")); + } + let expected = index_fold_oracle(&r, &input); + assert_eq!(index_fold(input), expected); + } + + #[test] + fn index_fold_char_examples() { + assert_eq!(index_fold_char('A'), b'a'); + assert_eq!(index_fold_char('!'), b'!'); + assert_eq!(index_fold_char('Ü'), 0xFC); + assert_eq!(index_fold_char('中'), 0x80 | 0x2D); + // Fold to ASCII keeps the high bit set. + assert_eq!(index_fold_char('\u{212A}'), 0x80 | b'k'); + } + + #[test] + fn index_fold_char_matches_index_fold_exhaustive() { + // Every code point's single-char `index_fold_char` must equal the lone + // byte `index_fold` produces for that character. + for cp in 0u32..0x110000 { + if (0xD800..0xE000).contains(&cp) { + continue; // surrogates aren't valid chars + } + let c = char::from_u32(cp).expect("cp is a valid non-surrogate char"); + let folded = index_fold(c.to_string()); + assert_eq!(folded.len(), 1, "cp {cp:#x} did not yield one byte"); + assert_eq!(index_fold_char(c), folded[0], "cp {cp:#x}"); + } + } +} diff --git a/crates/casefold/src/lib.rs b/crates/casefold/src/lib.rs index f826fe1..31ccf3b 100644 --- a/crates/casefold/src/lib.rs +++ b/crates/casefold/src/lib.rs @@ -70,201 +70,12 @@ mod table { include!(concat!(env!("OUT_DIR"), "/table.rs")); } -use table::*; - -/// Consumes `s` and returns its simple case-folded form as a `String`. The -/// input's heap buffer is reused untouched whenever folding changes no bytes — -/// that covers pure-ASCII / already-lowercase input (folded in place) *and* -/// any input whose multibyte characters never fold (CJK, Hangul, Kana, -/// Arabic, Hebrew, Indic, symbols, …). A fresh buffer is allocated only once -/// an actual case fold is encountered; from there, unmodified spans are -/// bulk-copied and folded characters are re-encoded in between. -/// -/// Folds may shrink (e.g. U+212A KELVIN SIGN is 3 bytes but folds to `k` = -/// 1 byte) or grow (e.g. U+023A `Ⱥ` is 2 bytes but folds to U+2C65 `ⱥ` = -/// 3 bytes), so in-place rewriting isn't possible in general — but inputs that -/// don't fold at all skip the second buffer entirely. -/// -/// Only **simple** (1-to-1) folds are applied; multi-character folds such as -/// `ß` → `ss` and Turkic locale folds are left unchanged. -/// -/// # Example -/// -/// ``` -/// use casefold::simple_fold; -/// assert_eq!(simple_fold("Hello, WORLD!".to_string()), "hello, world!"); -/// assert_eq!(simple_fold("ÜBER".to_string()), "über"); -/// // Length-changing fold (U+212A KELVIN SIGN → U+006B, 3 bytes → 1 byte): -/// assert_eq!(simple_fold("\u{212A}elvin".to_string()), "kelvin"); -/// ``` -pub fn simple_fold(s: String) -> String { - // SAFETY: `fold_into_bytes` only lowercases ASCII bytes in place and - // re-encodes whole characters through the fold table, so its output is - // always valid UTF-8 (see the exhaustive round-trip test). - unsafe { String::from_utf8_unchecked(fold_into_bytes(s)) } -} - -/// Byte-level core of [`simple_fold`]. Returns the fold as a `Vec` that is -/// always valid UTF-8; see [`simple_fold`] for the allocation behavior. -fn fold_into_bytes(s: String) -> Vec { - let mut bytes = s.into_bytes(); - // Tier 1 — full straight-through pass: lowercase every ASCII A..Z byte - // in place (a no-op on any non-ASCII byte, since `b.wrapping_sub(b'A')` - // is ≥ 26 for every byte outside 0x41..0x5A), and OR all bytes together - // so a single sign-bit test afterwards tells us whether the input - // contained any multibyte UTF-8 sequences. No early `break`, no - // input-dependent control flow — LLVM auto-vectorizes the loop. - let mut high_bit_acc: u8 = 0; - for b in &mut bytes { - high_bit_acc |= *b; - let is_upper = b.wrapping_sub(b'A') < 26; - *b |= u8::from(is_upper) << 5; - } - if high_bit_acc & 0x80 == 0 { - return bytes; - } - // Non-ASCII bytes are present. Locate the first one (SIMD-fast via - // `position`/memchr) and hand off to the UTF-8 path from there — the - // ASCII prefix is already lowercased and folding is idempotent on - // lower-case ASCII, so skipping it is purely an optimization. - let first_non_ascii = bytes - .iter() - .position(|&b| b & 0x80 != 0) - .expect("a non-ASCII byte exists (the high-bit accumulator was set)"); - fold_non_ascii_tail(bytes, first_non_ascii) -} +mod index_fold; +mod simple_fold; +pub use index_fold::{index_fold, index_fold_char}; +pub use simple_fold::simple_fold; -/// Tier 2 — copy-on-fold UTF-8 path. Scans the non-ASCII tail of the -/// already-(ASCII-)lowercased `bytes` for the first character that actually -/// folds to *different* bytes. Until one is found nothing is copied, so an -/// input whose multibyte content never folds is returned in its original -/// allocation untouched. Once a folding character is hit, a fresh buffer is -/// allocated and the rest is built by bulk-copying each contiguous unmodified -/// span and re-encoding the folded characters in between. The returned bytes -/// are always valid UTF-8. -/// -/// Characters are never fully decoded: the page index (`cp >> 6`) comes from -/// the first one or two bytes for the `PAGE_BITMAP` reject, and on a page hit -/// the remaining `cp & 0x3F` is read directly from the final byte to drive the -/// within-page run search and byte-delta fold — no code-point reconstruction. -fn fold_non_ascii_tail(bytes: Vec, start: usize) -> Vec { - let mut out: Vec = Vec::new(); - let src = bytes.as_ptr(); - // Raw write cursor into `out`'s buffer. Null until the first real fold - // allocates `out` (its pointer is then non-null), so `dst.is_null()` doubles - // as the "haven't started building the output yet" flag. We bypass the Vec - // push/reserve API: the buffer is reserved once for the worst case, so every - // copy/store below is unchecked. - let mut dst: *mut u8 = core::ptr::null_mut(); - // `flushed` marks the start of the contiguous run of `bytes` that is - // already correct but not yet copied out. - let mut flushed = 0usize; - let mut read = start; - while read < bytes.len() { - // ASCII (already lowercased by pass 1) — unchanged, keep scanning. - let lead = bytes[read]; - if lead & 0x80 == 0 { - read += 1; - continue; - } - // Page-precision reject probe (see the module docs). - let (page, c_len) = if lead < 0xE0 { - ((lead & 0x1F) as u32, 2usize) - } else if lead < 0xF0 { - ( - (((lead & 0x0F) as u32) << 6) | (bytes[read + 1] & 0x3F) as u32, - 3, - ) - } else { - ( - (((lead & 0x07) as u32) << 12) - | (((bytes[read + 1] & 0x3F) as u32) << 6) - | (bytes[read + 2] & 0x3F) as u32, - 4, - ) - }; - let word_idx = (page >> 6) as usize; - if word_idx >= PAGE_BITMAP.len() || (PAGE_BITMAP[word_idx] >> (page & 63)) & 1 == 0 { - read += c_len; - continue; - } - let low_v = bytes[read + c_len - 1] & 0x3F; - let dense = popcount_up_to(page) as usize; - let lo = PAGE_OFFSET[dense] as usize; - let n = PAGE_OFFSET[dense + 1] as usize - lo; - let off = scan_end_low(lo, n, low_v); - let idx = if off < n { - // The scan guarantees `low_v <= end_low`; the run covers `low_v` - // iff `low_v >= start_low` (and, for stride 2, the offset is even). - // No code-point reconstruction — `low_v` is compared directly. - let ss = RUN_START_STRIDE[lo + off]; - let start_low = ss & 0x3F; - let stride_bit = ss >> 6; - if low_v < start_low || ((low_v - start_low) & stride_bit) != 0 { - read += c_len; - continue; - } - lo + off - } else { - read += c_len; - continue; - }; - // Load the character's bytes as a little-endian u32, mask off the lanes - // past it, add the run's constant byte delta. Over-reading 4 bytes is - // safe except within ≤3 bytes of the buffer end; the variable-length - // fallback there is far slower (a `memcpy` call per fold), so the fast - // path is worth the branch. - let raw = if read + 4 <= bytes.len() { - u32::from_le_bytes(bytes[read..read + 4].try_into().expect("4-byte slice")) - } else { - let mut w = [0u8; 4]; - w[..c_len].copy_from_slice(&bytes[read..read + c_len]); - u32::from_le_bytes(w) - }; - let word = raw & (u32::MAX >> ((4 - c_len) * 8)); - let folded = word.wrapping_add(BYTE_DELTA[idx]); - let dest_len = utf8_len((folded & 0xFF) as u8); - if dst.is_null() { - // Reserve once for the worst case so the writes below never need a - // per-store capacity check. Output is at most 1.5× the input: the - // only folds that grow are U+023A/U+023E (2→3 bytes), so every 2 - // input bytes yield ≤3 output bytes; `+ 4` covers the 4-byte - // over-store of the final character. The non-zero capacity makes - // `out.as_mut_ptr()` non-null, so `dst` is non-null from here on. - out = Vec::with_capacity(bytes.len() + bytes.len() / 2 + 4); - dst = out.as_mut_ptr(); - } - // SAFETY: the buffer is reserved for the worst-case 1.5× output plus 4 - // bytes of over-store headroom, so `dst` (the running output length) - // plus the 4-byte store stays in bounds for every iteration. `src` and - // `dst` are distinct allocations. - unsafe { - let run = read - flushed; - if run != 0 { - core::ptr::copy_nonoverlapping(src.add(flushed), dst, run); - dst = dst.add(run); - } - // Store a full 4-byte word, advance only by the real folded length. - dst.cast::().write_unaligned(folded.to_le()); - dst = dst.add(dest_len); - } - read += c_len; - flushed = read; - } - if dst.is_null() { - // Nothing folded — return the original buffer with no extra copy. - return bytes; - } - // SAFETY: the trailing unmodified run fits in the reserved buffer; `dst` - // minus the base pointer is the total number of bytes written. - unsafe { - let tail = bytes.len() - flushed; - core::ptr::copy_nonoverlapping(src.add(flushed), dst, tail); - dst = dst.add(tail); - out.set_len(dst as usize - out.as_ptr() as usize); - } - out -} +use table::*; /// Number of bytes in the UTF-8 sequence whose lead byte is `lead`. /// @@ -289,6 +100,7 @@ const fn table_size_bytes() -> usize { + RUN_END_LOW.len() + RUN_START_STRIDE.len() + BYTE_DELTA.len() * 4 + + INDEX_DELTA.len() } // ---- Paged bitmap lookup ------------------------------------------------ @@ -310,13 +122,12 @@ const fn table_size_bytes() -> usize { // RUN_START_STRIDE[i] = (start & PAGE_MASK) | ((stride - 1) << 6) // (membership, vs `cp & 0x3F`) -/// Number of populated pages strictly before `page`. +/// Number of populated pages strictly before the page located at +/// `PAGE_BITMAP[word_idx]` bit `bit_idx`. #[inline] -fn popcount_up_to(page: u32) -> u32 { - let word_idx = (page / 64) as usize; - let bit_in_word = page % 64; +fn popcount_up_to(word_idx: usize, bit_idx: u32) -> u32 { let base = POPCNT_SAMPLES[word_idx] as u32; - let partial = PAGE_BITMAP[word_idx] & ((1u64 << bit_in_word).wrapping_sub(1)); + let partial = PAGE_BITMAP[word_idx] & ((1u64 << bit_idx).wrapping_sub(1)); base + partial.count_ones() } @@ -353,12 +164,13 @@ fn scan_end_low(lo: usize, n: usize, low_v: u8) -> usize { } #[cfg(test)] -mod tests { - use super::*; +pub(crate) mod test_support { use std::collections::HashMap; use std::fs; - fn reference() -> HashMap { + /// Parse `data/CaseFolding.txt` into a simple-fold map (statuses `C` and + /// `S`), shared by the `simple_fold` and `index_fold` cross-check tests. + pub(crate) fn reference() -> HashMap { let text = fs::read_to_string("data/CaseFolding.txt").expect("CaseFolding.txt"); let mut out = HashMap::new(); for raw in text.lines() { @@ -390,18 +202,11 @@ mod tests { } out } +} - /// Per-character fold via the reference map, used as the oracle for the - /// byte-oriented `fold_into_bytes` cross-checks below. - fn fold_oracle(r: &HashMap, s: &str) -> Vec { - let mut out = String::new(); - for c in s.chars() { - let cp = c as u32; - let folded = r.get(&cp).copied().unwrap_or(cp); - out.push(char::from_u32(folded).expect("reference fold is a valid char")); - } - out.into_bytes() - } +#[cfg(test)] +mod tests { + use super::*; #[test] fn table_is_compact() { @@ -411,140 +216,4 @@ mod tests { eprintln!("table size: {sz} bytes for {NUM_FOLD_ENTRIES} entries"); assert!(sz < 2400, "table size {sz} exceeds 2400 B budget"); } - - #[test] - fn fold_into_bytes_ascii() { - assert_eq!(fold_into_bytes(String::new()), b""); - assert_eq!(fold_into_bytes("Hello, WORLD!".into()), b"hello, world!"); - assert_eq!(fold_into_bytes("abc 123 XYZ".into()), b"abc 123 xyz"); - } - - #[test] - fn simple_fold_returns_string() { - // Public `String` wrapper: ASCII, length-preserving, shrinking and - // growing folds all yield valid UTF-8. - assert_eq!(simple_fold("Hello, WORLD!".to_string()), "hello, world!"); - assert_eq!(simple_fold("ÜBER Größe".to_string()), "über größe"); - assert_eq!(simple_fold("\u{212A}elvin".to_string()), "kelvin"); - assert_eq!(simple_fold("abc\u{023A}".to_string()), "abc\u{2C65}"); - // Non-folding multibyte content is returned unchanged. - assert_eq!(simple_fold("漢字 שלום".to_string()), "漢字 שלום"); - } - - #[test] - fn fold_into_bytes_ascii_then_utf8_handoff() { - // ASCII prefix gets lowercased by the tier-1 loop, then control - // hands off to the tier-2 reallocating UTF-8 path at the first - // multibyte lead. - assert_eq!( - fold_into_bytes("MIXED Größe TEXT".into()), - "mixed größe text".as_bytes(), - ); - // ASCII prefix, then a *shrinking* fold inside the tail. - assert_eq!(fold_into_bytes("LORD \u{212A}elvin".into()), b"lord kelvin",); - // ASCII prefix, then a *growing* fold. - assert_eq!( - fold_into_bytes("abc\u{023A}".into()), - "abc\u{2C65}".as_bytes(), - ); - } - - #[test] - fn fold_into_bytes_length_preserving_bmp() { - assert_eq!(fold_into_bytes("ÄÖÜ".into()), "äöü".as_bytes()); - assert_eq!(fold_into_bytes("ΑΒΓ".into()), "αβγ".as_bytes()); - assert_eq!(fold_into_bytes("漢字".into()), "漢字".as_bytes()); - } - - #[test] - fn fold_into_bytes_reuses_buffer_for_ascii_input() { - // Pure-ASCII inputs are lowercased in place — the returned Vec must - // hold the exact same allocation as the input String. - let s = "MIXED case AsCiI 12345".to_string(); - let original_ptr = s.as_ptr(); - let out = fold_into_bytes(s); - assert_eq!(out, b"mixed case ascii 12345"); - assert_eq!(out.as_ptr(), original_ptr); - } - - #[test] - fn fold_into_bytes_reuses_buffer_for_nonfolding_nonascii() { - // Non-ASCII content that never folds (CJK + Hebrew) plus ASCII upper - // case: the ASCII is lowercased in place and, because no multibyte - // character folds, the original allocation is handed back with no - // second buffer — same pointer as the input String. - let s = "HELLO 日本語 שלום WORLD".to_string(); - let original_ptr = s.as_ptr(); - let out = fold_into_bytes(s); - assert_eq!(out, "hello 日本語 שלום world".as_bytes()); - assert_eq!(out.as_ptr(), original_ptr); - } - - #[test] - fn fold_into_bytes_handles_shrinking_fold() { - // U+212A KELVIN SIGN (3 bytes) folds to U+006B 'k' (1 byte). - assert_eq!(fold_into_bytes("\u{212A}elvin".into()), b"kelvin"); - // Shrink inside a longer string. - let out = fold_into_bytes("LORD \u{212A}elvin RULES".into()); - assert_eq!(out, b"lord kelvin rules"); - // U+2126 OHM SIGN (3 bytes) folds to U+03C9 'ω' (2 bytes). - assert_eq!(fold_into_bytes("\u{2126}".into()), "\u{03C9}".as_bytes()); - } - - #[test] - fn fold_into_bytes_handles_growing_fold() { - // The Unicode 16.0 simple-fold table has exactly two folds that - // grow in UTF-8 length (verified by scanning CaseFolding.txt): - // U+023A → U+2C65 and U+023E → U+2C66, both 2 B → 3 B. - - // U+023A 'Ⱥ' is 2 bytes, folds to U+2C65 'ⱥ' = 3 bytes. - assert_eq!(fold_into_bytes("\u{023A}".into()), "\u{2C65}".as_bytes()); - // U+023E 'Ⱦ' is 2 bytes, folds to U+2C66 'ⱦ' = 3 bytes. - assert_eq!(fold_into_bytes("\u{023E}".into()), "\u{2C66}".as_bytes()); - - // Each one mid-string, with mixed length-preserving context on both - // sides so that the bail-out path also copies a prefix that already - // contains a length-preserving rewrite. - let out = fold_into_bytes("ABC\u{023A}xyz".into()); - assert_eq!(out, "abc\u{2C65}xyz".as_bytes()); - let out = fold_into_bytes("ABC\u{023E}xyz".into()); - assert_eq!(out, "abc\u{2C66}xyz".as_bytes()); - - // Both growing folds inside the same string: the second one occurs - // after we have already switched to the allocating buffer. - let out = fold_into_bytes("\u{023A}\u{023E}".into()); - assert_eq!(out, "\u{2C65}\u{2C66}".as_bytes()); - - // Mixed: a length-preserving fold, then a shrinking fold, then both - // growing folds — exercises every branch in one input. - let out = fold_into_bytes("Ä\u{212A}\u{023A}\u{023E}".into()); - assert_eq!(out, "ä\u{006B}\u{2C65}\u{2C66}".as_bytes()); - } - - #[test] - fn fold_into_bytes_matches_reference_map() { - // Cross-check against the reference fold map on a varied input. - let r = reference(); - let input = "Quick BROWN Fox 🦊 ÜBER Größe ΣΟΦΙΑ \u{0130}\u{023A}漢"; - assert_eq!(fold_into_bytes(input.to_string()), fold_oracle(&r, input)); - } - - #[test] - fn fold_into_bytes_matches_reference_map_exhaustive() { - // Drive every assigned code point through the byte-oriented fold path - // and cross-check against the reference fold map. This guarantees the - // UTF-8 lead-byte reject filter never skips a code point that actually - // folds (a false reject would corrupt output here). A leading 'X' - // forces the tier-2 UTF-8 tail to run from the very first char. - let r = reference(); - let mut input = String::from("X"); - for cp in 0x80..0x110000u32 { - if (0xD800..0xE000).contains(&cp) { - continue; // surrogates aren't valid chars - } - input.push(char::from_u32(cp).expect("cp is a valid non-surrogate char")); - } - let expected = fold_oracle(&r, &input); - assert_eq!(fold_into_bytes(input), expected); - } } diff --git a/crates/casefold/src/simple_fold.rs b/crates/casefold/src/simple_fold.rs new file mode 100644 index 0000000..81417de --- /dev/null +++ b/crates/casefold/src/simple_fold.rs @@ -0,0 +1,352 @@ +//! Unicode simple case-folding to a `String`, built on the shared paged-bitmap +//! run table. + +use crate::table::*; +use crate::{popcount_up_to, scan_end_low, utf8_len}; + +/// Consumes `s` and returns its simple case-folded form as a `String`. The +/// input's heap buffer is reused untouched whenever folding changes no bytes — +/// that covers pure-ASCII / already-lowercase input (folded in place) *and* +/// any input whose multibyte characters never fold (CJK, Hangul, Kana, +/// Arabic, Hebrew, Indic, symbols, …). A fresh buffer is allocated only once +/// an actual case fold is encountered; from there, unmodified spans are +/// bulk-copied and folded characters are re-encoded in between. +/// +/// Folds may shrink (e.g. U+212A KELVIN SIGN is 3 bytes but folds to `k` = +/// 1 byte) or grow (e.g. U+023A `Ⱥ` is 2 bytes but folds to U+2C65 `ⱥ` = +/// 3 bytes), so in-place rewriting isn't possible in general — but inputs that +/// don't fold at all skip the second buffer entirely. +/// +/// Only **simple** (1-to-1) folds are applied; multi-character folds such as +/// `ß` → `ss` and Turkic locale folds are left unchanged. +/// +/// # Example +/// +/// ``` +/// use casefold::simple_fold; +/// assert_eq!(simple_fold("Hello, WORLD!".to_string()), "hello, world!"); +/// assert_eq!(simple_fold("ÜBER".to_string()), "über"); +/// // Length-changing fold (U+212A KELVIN SIGN → U+006B, 3 bytes → 1 byte): +/// assert_eq!(simple_fold("\u{212A}elvin".to_string()), "kelvin"); +/// ``` +pub fn simple_fold(s: String) -> String { + // SAFETY: `fold_into_bytes` only lowercases ASCII bytes in place and + // re-encodes whole characters through the fold table, so its output is + // always valid UTF-8 (see the exhaustive round-trip test). + unsafe { String::from_utf8_unchecked(fold_into_bytes(s)) } +} + +/// Byte-level core of [`simple_fold`]. Returns the fold as a `Vec` that is +/// always valid UTF-8; see [`simple_fold`] for the allocation behavior. +fn fold_into_bytes(s: String) -> Vec { + let mut bytes = s.into_bytes(); + // Tier 1 — full straight-through pass: lowercase every ASCII A..Z byte + // in place (a no-op on any non-ASCII byte, since `b.wrapping_sub(b'A')` + // is ≥ 26 for every byte outside 0x41..0x5A), and OR all bytes together + // so a single sign-bit test afterwards tells us whether the input + // contained any multibyte UTF-8 sequences. No early `break`, no + // input-dependent control flow — LLVM auto-vectorizes the loop. + let mut high_bit_acc: u8 = 0; + for b in &mut bytes { + high_bit_acc |= *b; + let is_upper = b.wrapping_sub(b'A') < 26; + *b |= u8::from(is_upper) << 5; + } + if high_bit_acc & 0x80 == 0 { + return bytes; + } + // Non-ASCII bytes are present. Locate the first one (SIMD-fast via + // `position`/memchr) and hand off to the UTF-8 path from there — the + // ASCII prefix is already lowercased and folding is idempotent on + // lower-case ASCII, so skipping it is purely an optimization. + let first_non_ascii = bytes + .iter() + .position(|&b| b & 0x80 != 0) + .expect("a non-ASCII byte exists (the high-bit accumulator was set)"); + fold_non_ascii_tail(bytes, first_non_ascii) +} + +/// Tier 2 — copy-on-fold UTF-8 path. Scans the non-ASCII tail of the +/// already-(ASCII-)lowercased `bytes` for the first character that actually +/// folds to *different* bytes. Until one is found nothing is copied, so an +/// input whose multibyte content never folds is returned in its original +/// allocation untouched. Once a folding character is hit, a fresh buffer is +/// allocated and the rest is built by bulk-copying each contiguous unmodified +/// span and re-encoding the folded characters in between. The returned bytes +/// are always valid UTF-8. +/// +/// Characters are never fully decoded: the page index (`cp >> 6`) comes from +/// the first one or two bytes for the `PAGE_BITMAP` reject, and on a page hit +/// the remaining `cp & 0x3F` is read directly from the final byte to drive the +/// within-page run search and byte-delta fold — no code-point reconstruction. +fn fold_non_ascii_tail(bytes: Vec, start: usize) -> Vec { + let mut out: Vec = Vec::new(); + let src = bytes.as_ptr(); + // Raw write cursor into `out`'s buffer. Null until the first real fold + // allocates `out` (its pointer is then non-null), so `dst.is_null()` doubles + // as the "haven't started building the output yet" flag. We bypass the Vec + // push/reserve API: the buffer is reserved once for the worst case, so every + // copy/store below is unchecked. + let mut dst: *mut u8 = core::ptr::null_mut(); + // `flushed` marks the start of the contiguous run of `bytes` that is + // already correct but not yet copied out. + let mut flushed = 0usize; + let mut read = start; + while read < bytes.len() { + // ASCII (already lowercased by pass 1) — unchanged, keep scanning. + let lead = bytes[read]; + if lead & 0x80 == 0 { + read += 1; + continue; + } + // Page-precision reject probe (see the module docs). Recover the + // `PAGE_BITMAP` coordinates of `cp >> 6` directly as `(word_idx, + // bit_idx)` — `cp >> 12` indexes the bitmap word and `(cp >> 6) & 63` + // the bit — without materializing the combined page number. + let (word_idx, bit_idx, c_len) = if lead < 0xE0 { + (0usize, (lead & 0x1F) as u32, 2usize) + } else if lead < 0xF0 { + ((lead & 0x0F) as usize, (bytes[read + 1] & 0x3F) as u32, 3) + } else { + ( + (((lead & 0x07) as usize) << 6) | (bytes[read + 1] & 0x3F) as usize, + (bytes[read + 2] & 0x3F) as u32, + 4, + ) + }; + if word_idx >= PAGE_BITMAP.len() || (PAGE_BITMAP[word_idx] >> bit_idx) & 1 == 0 { + read += c_len; + continue; + } + let low_v = bytes[read + c_len - 1] & 0x3F; + let dense = popcount_up_to(word_idx, bit_idx) as usize; + let lo = PAGE_OFFSET[dense] as usize; + let n = PAGE_OFFSET[dense + 1] as usize - lo; + let off = scan_end_low(lo, n, low_v); + let idx = if off < n { + // The scan guarantees `low_v <= end_low`; the run covers `low_v` + // iff `low_v >= start_low` (and, for stride 2, the offset is even). + // No code-point reconstruction — `low_v` is compared directly. + let ss = RUN_START_STRIDE[lo + off]; + let start_low = ss & 0x3F; + let stride_bit = ss >> 6; + if low_v < start_low || ((low_v - start_low) & stride_bit) != 0 { + read += c_len; + continue; + } + lo + off + } else { + read += c_len; + continue; + }; + // Load the character's bytes as a little-endian u32, mask off the lanes + // past it, add the run's constant byte delta. Over-reading 4 bytes is + // safe except within ≤3 bytes of the buffer end; the variable-length + // fallback there is far slower (a `memcpy` call per fold), so the fast + // path is worth the branch. + let raw = if read + 4 <= bytes.len() { + u32::from_le_bytes(bytes[read..read + 4].try_into().expect("4-byte slice")) + } else { + let mut w = [0u8; 4]; + w[..c_len].copy_from_slice(&bytes[read..read + c_len]); + u32::from_le_bytes(w) + }; + let word = raw & (u32::MAX >> ((4 - c_len) * 8)); + let folded = word.wrapping_add(BYTE_DELTA[idx]); + let dest_len = utf8_len((folded & 0xFF) as u8); + if dst.is_null() { + // Reserve once for the worst case so the writes below never need a + // per-store capacity check. Output is at most 1.5× the input: the + // only folds that grow are U+023A/U+023E (2→3 bytes), so every 2 + // input bytes yield ≤3 output bytes; `+ 4` covers the 4-byte + // over-store of the final character. The non-zero capacity makes + // `out.as_mut_ptr()` non-null, so `dst` is non-null from here on. + out = Vec::with_capacity(bytes.len() + bytes.len() / 2 + 4); + dst = out.as_mut_ptr(); + } + // SAFETY: the buffer is reserved for the worst-case 1.5× output plus 4 + // bytes of over-store headroom, so `dst` (the running output length) + // plus the 4-byte store stays in bounds for every iteration. `src` and + // `dst` are distinct allocations. + unsafe { + let run = read - flushed; + if run != 0 { + core::ptr::copy_nonoverlapping(src.add(flushed), dst, run); + dst = dst.add(run); + } + // Store a full 4-byte word, advance only by the real folded length. + dst.cast::().write_unaligned(folded.to_le()); + dst = dst.add(dest_len); + } + read += c_len; + flushed = read; + } + if dst.is_null() { + // Nothing folded — return the original buffer with no extra copy. + return bytes; + } + // SAFETY: the trailing unmodified run fits in the reserved buffer; `dst` + // minus the base pointer is the total number of bytes written. + unsafe { + let tail = bytes.len() - flushed; + core::ptr::copy_nonoverlapping(src.add(flushed), dst, tail); + dst = dst.add(tail); + out.set_len(dst as usize - out.as_ptr() as usize); + } + out +} + +#[cfg(test)] +mod tests { + use super::*; + use crate::test_support::reference; + use std::collections::HashMap; + + /// Per-character fold via the reference map, used as the oracle for the + /// byte-oriented `fold_into_bytes` cross-checks below. + fn fold_oracle(r: &HashMap, s: &str) -> Vec { + let mut out = String::new(); + for c in s.chars() { + let cp = c as u32; + let folded = r.get(&cp).copied().unwrap_or(cp); + out.push(char::from_u32(folded).expect("reference fold is a valid char")); + } + out.into_bytes() + } + + #[test] + fn fold_into_bytes_ascii() { + assert_eq!(fold_into_bytes(String::new()), b""); + assert_eq!(fold_into_bytes("Hello, WORLD!".into()), b"hello, world!"); + assert_eq!(fold_into_bytes("abc 123 XYZ".into()), b"abc 123 xyz"); + } + + #[test] + fn simple_fold_returns_string() { + // Public `String` wrapper: ASCII, length-preserving, shrinking and + // growing folds all yield valid UTF-8. + assert_eq!(simple_fold("Hello, WORLD!".to_string()), "hello, world!"); + assert_eq!(simple_fold("ÜBER Größe".to_string()), "über größe"); + assert_eq!(simple_fold("\u{212A}elvin".to_string()), "kelvin"); + assert_eq!(simple_fold("abc\u{023A}".to_string()), "abc\u{2C65}"); + // Non-folding multibyte content is returned unchanged. + assert_eq!(simple_fold("漢字 שלום".to_string()), "漢字 שלום"); + } + + #[test] + fn fold_into_bytes_ascii_then_utf8_handoff() { + // ASCII prefix gets lowercased by the tier-1 loop, then control + // hands off to the tier-2 reallocating UTF-8 path at the first + // multibyte lead. + assert_eq!( + fold_into_bytes("MIXED Größe TEXT".into()), + "mixed größe text".as_bytes(), + ); + // ASCII prefix, then a *shrinking* fold inside the tail. + assert_eq!(fold_into_bytes("LORD \u{212A}elvin".into()), b"lord kelvin",); + // ASCII prefix, then a *growing* fold. + assert_eq!( + fold_into_bytes("abc\u{023A}".into()), + "abc\u{2C65}".as_bytes(), + ); + } + + #[test] + fn fold_into_bytes_length_preserving_bmp() { + assert_eq!(fold_into_bytes("ÄÖÜ".into()), "äöü".as_bytes()); + assert_eq!(fold_into_bytes("ΑΒΓ".into()), "αβγ".as_bytes()); + assert_eq!(fold_into_bytes("漢字".into()), "漢字".as_bytes()); + } + + #[test] + fn fold_into_bytes_reuses_buffer_for_ascii_input() { + // Pure-ASCII inputs are lowercased in place — the returned Vec must + // hold the exact same allocation as the input String. + let s = "MIXED case AsCiI 12345".to_string(); + let original_ptr = s.as_ptr(); + let out = fold_into_bytes(s); + assert_eq!(out, b"mixed case ascii 12345"); + assert_eq!(out.as_ptr(), original_ptr); + } + + #[test] + fn fold_into_bytes_reuses_buffer_for_nonfolding_nonascii() { + // Non-ASCII content that never folds (CJK + Hebrew) plus ASCII upper + // case: the ASCII is lowercased in place and, because no multibyte + // character folds, the original allocation is handed back with no + // second buffer — same pointer as the input String. + let s = "HELLO 日本語 שלום WORLD".to_string(); + let original_ptr = s.as_ptr(); + let out = fold_into_bytes(s); + assert_eq!(out, "hello 日本語 שלום world".as_bytes()); + assert_eq!(out.as_ptr(), original_ptr); + } + + #[test] + fn fold_into_bytes_handles_shrinking_fold() { + // U+212A KELVIN SIGN (3 bytes) folds to U+006B 'k' (1 byte). + assert_eq!(fold_into_bytes("\u{212A}elvin".into()), b"kelvin"); + // Shrink inside a longer string. + let out = fold_into_bytes("LORD \u{212A}elvin RULES".into()); + assert_eq!(out, b"lord kelvin rules"); + // U+2126 OHM SIGN (3 bytes) folds to U+03C9 'ω' (2 bytes). + assert_eq!(fold_into_bytes("\u{2126}".into()), "\u{03C9}".as_bytes()); + } + + #[test] + fn fold_into_bytes_handles_growing_fold() { + // The Unicode 16.0 simple-fold table has exactly two folds that + // grow in UTF-8 length (verified by scanning CaseFolding.txt): + // U+023A → U+2C65 and U+023E → U+2C66, both 2 B → 3 B. + + // U+023A 'Ⱥ' is 2 bytes, folds to U+2C65 'ⱥ' = 3 bytes. + assert_eq!(fold_into_bytes("\u{023A}".into()), "\u{2C65}".as_bytes()); + // U+023E 'Ⱦ' is 2 bytes, folds to U+2C66 'ⱦ' = 3 bytes. + assert_eq!(fold_into_bytes("\u{023E}".into()), "\u{2C66}".as_bytes()); + + // Each one mid-string, with mixed length-preserving context on both + // sides so that the bail-out path also copies a prefix that already + // contains a length-preserving rewrite. + let out = fold_into_bytes("ABC\u{023A}xyz".into()); + assert_eq!(out, "abc\u{2C65}xyz".as_bytes()); + let out = fold_into_bytes("ABC\u{023E}xyz".into()); + assert_eq!(out, "abc\u{2C66}xyz".as_bytes()); + + // Both growing folds inside the same string: the second one occurs + // after we have already switched to the allocating buffer. + let out = fold_into_bytes("\u{023A}\u{023E}".into()); + assert_eq!(out, "\u{2C65}\u{2C66}".as_bytes()); + + // Mixed: a length-preserving fold, then a shrinking fold, then both + // growing folds — exercises every branch in one input. + let out = fold_into_bytes("Ä\u{212A}\u{023A}\u{023E}".into()); + assert_eq!(out, "ä\u{006B}\u{2C65}\u{2C66}".as_bytes()); + } + + #[test] + fn fold_into_bytes_matches_reference_map() { + // Cross-check against the reference fold map on a varied input. + let r = reference(); + let input = "Quick BROWN Fox 🦊 ÜBER Größe ΣΟΦΙΑ \u{0130}\u{023A}漢"; + assert_eq!(fold_into_bytes(input.to_string()), fold_oracle(&r, input)); + } + + #[test] + fn fold_into_bytes_matches_reference_map_exhaustive() { + // Drive every assigned code point through the byte-oriented fold path + // and cross-check against the reference fold map. This guarantees the + // UTF-8 lead-byte reject filter never skips a code point that actually + // folds (a false reject would corrupt output here). A leading 'X' + // forces the tier-2 UTF-8 tail to run from the very first char. + let r = reference(); + let mut input = String::from("X"); + for cp in 0x80..0x110000u32 { + if (0xD800..0xE000).contains(&cp) { + continue; // surrogates aren't valid chars + } + input.push(char::from_u32(cp).expect("cp is a valid non-surrogate char")); + } + let expected = fold_oracle(&r, &input); + assert_eq!(fold_into_bytes(input), expected); + } +}