Some edits and formatting cleanup.#125
Conversation
There was a problem hiding this comment.
Pull request overview
This PR relocates the casefold performance/release blog content from crates/casefold/BLOG.md into a docs-targeted Markdown file intended for rendering from crates/casefold/docs/.
Changes:
- Added
crates/casefold/docs/release_blog.mdcontaining the blog post content. - Removed the previous
crates/casefold/BLOG.mdversion of the post.
Show a summary per file
| File | Description |
|---|---|
crates/casefold/docs/release_blog.md |
New docs-hosted blog Markdown; currently contains several Markdown/emphasis and Rust snippet formatting issues that affect rendering/copy-paste correctness. |
crates/casefold/BLOG.md |
Deleted the prior blog post file from the crate root (content moved to docs). |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 2/2 changed files
- Comments generated: 11
| - **Identifiers and protocols | ||
| ** use case-insensitive comparison of usernames, hostnames, file paths, HTTP headers, and so on. |
| These diverge on real characters — `ß`, | ||
| `İ`, final sigma — and lowercasing as a stand-in silently produces incorrect matches. This crate implements the **simple | ||
| ** (1-to-1) folds — statuses `C` and `S` in [ | ||
| `CaseFolding.txt`](https://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt) — and deliberately | ||
| *not* the multi-character "full" folds (`ß` → `ss`) or Turkic locale folds. |
|
|
||
| It looks ideal: do the cheap byte work, and the instant you hit a non-ASCII byte, | ||
| `break` and let the "real" Unicode path take over — "only do the cheap work until you have to." On an Apple M4 this runs at about | ||
| **3 GiB/s**. That sounds fine in isolation, but it is more than **15× short** of "optimal" because of the `if` branchs. |
| let mut high_bit_acc: u8 = 0; | ||
| for b in & mut bytes { | ||
| high_bit_acc |= * b; // detect any non-ASCII byte | ||
| let is_upper = b.wrapping_sub(b'A') < 26; // branchless A..=Z test | ||
| * b |= u8::from(is_upper) < < 5; // set bit 5 → lowercase, else no-op | ||
| } | ||
| if high_bit_acc & 0x80 == 0 { | ||
| return bytes; // pure ASCII: already folded in place, no second buffer | ||
| } |
| `u64` lanes and checking all their high bits with a single | ||
| `& 0x8080_8080_8080_8080` mask. You can build the ASCII fast path on top of that: chunk-scan to find the ASCII prefix, then run the branchless (vectorizable) convert over it. That keeps the early-exit ability — it still bails on the first non-ASCII block — while letting both halves go fast. The catch is that it reads the data | ||
| **twice** (once to scan, once to convert), landing at about **23 GiB/s | ||
| ** — roughly half of the single-pass branchless sweep, and ~7× the naive break loop. A solid, general-purpose default; just not the absolute ceiling when you control the whole loop and can fold detection and conversion into one branch-free pass. |
| > Branchless is a *pessimization* in scalar code.** Look again at the | ||
| > table: making the body branchless while *keeping* the `break` (2.6 GiB/s) is |
| 40 GiB/s also means doing zero unnecessary allocation. `simple_fold` takes the input `String` *by | ||
| value*, owning the heap buffer it can mutate and return it. If the OR-accumulator's high bit was clear, the input was pure ASCII — already folded in place — we hand the | ||
| **same allocation** straight back, no second buffer and no copy. Otherwise we | ||
| `memchr` to the first non-ASCII byte and scan the tail from there, leaving the output buffer | ||
| *unallocated* (a null write cursor) until we hit a character that folds to **different bytes | ||
| **. Text whose multibyte content never folds — CJK, Hangul, Kana, Arabic, Hebrew, symbols — also returns the original allocation untouched, never copying a byte. | ||
|
|
||
| Why a *second* buffer rather than rewriting in place like the ASCII pass? Because folding can make the string **longer | ||
| **: almost every fold preserves the UTF-8 length or shrinks it, but two outliers grow — U+023A (`Ⱥ`) and U+023E ( | ||
| `Ɀ`) are 2 bytes each yet fold to 3-byte characters (`ⱥ`, | ||
| `ɀ`). Once one appears, the output no longer fits in the input's bytes, and we need somewhere new to write. |
| The pure-ASCII row is the fairest fight of all: there `str::to_lowercase` | ||
| produces the **exact same bytes | ||
| ** we do — a correct std-library baseline rather than a different operation — and even then the branch-free sweep is ~1.5× faster (40.8 vs 27.7 GiB/s), because | ||
| `to_lowercase` still scans for the first non-ASCII byte and allocates a fresh |
| let (word_idx, bit_idx, c_len) = if lead < 0xE0 { | ||
| (0usize, lead & 0x1F, 2usize) // 2-byte: word 0 | ||
| } else if lead < 0xF0 { | ||
| ((lead & 0x0F) as usize, bytes[read + 1] & 0x3F, 3) // 3-byte: word = nibble | ||
| } else { | ||
| ((((lead & 0x07) as usize) < < 6) | (bytes[read + 1] & 0x3F) as usize, bytes[read + 2] & 0x3F, 4usize,)// 4-byte: merge 2 bytes | ||
| }; | ||
| // reject without decoding: clear bit ⇒ no fold | ||
| if word_idx > = PAGE_BITMAP.len() || (PAGE_BITMAP[word_idx] >> bit_idx) & 1 == 0 { | ||
| read += c_len; | ||
| continue; | ||
| } |
|
|
||
| The crate is [`casefold`](../README.md); the generated table and full design notes live alongside the source. | ||
|
|
||
| [^overlong]: The byte-space arithmetic assumes the input is **well-formed, shortest-form UTF-8 |
There was a problem hiding this comment.
We probably need to decide if we want to use footnotes, info boxes or both.
Addresses PR #125 review: move the 'Treat the absolute figures as illustrative' note out of the table intro into a [^bench] footnote defined at the bottom of the file alongside [^overlong]. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
krukow
left a comment
There was a problem hiding this comment.
Looks good - the main feedback from me is explain the concept/problem a bit before showing the performance table, and also explaining each of the columns and rows in that table (right now it's hard to understand what the terms refer to in the table for a reader unfamiliar with the subject).
|
|
||
| Let's walk through the evolution in detail. | ||
|
|
||
| ## Why case-folding is even important? |
There was a problem hiding this comment.
I would pull this up before the data table
|
|
||
| Criterion medians on an Apple M4 (single core, `target-cpu=native`).[^bench] | ||
|
|
||
| | Workload (input size) | `simple_fold` | `simd_normalizer` | `HashMap` (byte path) | `str::to_lowercase` | `simdutf` round-trip | |
There was a problem hiding this comment.
can we explain how to read the table for a reader who is unfamiliar with the elements in the table?
| > touches the data half as many times. It's the same lesson one more time: in the | ||
| > hot loop, the branch is the enemy. | ||
|
|
||
| It is genuinely faster to |
There was a problem hiding this comment.
Can we elevate these lessons to the top of the article as a teaser and read on to learn more?
| | A runtime `HashMap<u32, u32>` | ~17 KB | | ||
| | **This crate (paged bitmap + packed runs)** | **1776 B** | | ||
|
|
||
| ## Takeaways |
There was a problem hiding this comment.
same for this - consider elevate these take aways to the top of the article as a teaser and read on to learn more?
rendered