Some edits and formatting cleanup. by gorzell · Pull Request #125 · github/rust-gems

gorzell · 2026-06-10T12:51:24Z

Copilot

Pull request overview

This PR relocates the casefold performance/release blog content from crates/casefold/BLOG.md into a docs-targeted Markdown file intended for rendering from crates/casefold/docs/.

Changes:

Added crates/casefold/docs/release_blog.md containing the blog post content.
Removed the previous crates/casefold/BLOG.md version of the post.

Show a summary per file

File	Description
`crates/casefold/docs/release_blog.md`	New docs-hosted blog Markdown; currently contains several Markdown/emphasis and Rust snippet formatting issues that affect rendering/copy-paste correctness.
`crates/casefold/BLOG.md`	Deleted the prior blog post file from the crate root (content moved to docs).

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Files reviewed: 2/2 changed files
Comments generated: 11

+- **Identifiers and protocols
+  ** use case-insensitive comparison of usernames, hostnames, file paths, HTTP headers, and so on.


+These diverge on real characters — `ß`,
+`İ`, final sigma — and lowercasing as a stand-in silently produces incorrect matches. This crate implements the **simple
+** (1-to-1) folds — statuses `C` and `S` in [
+`CaseFolding.txt`](https://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt) — and deliberately
+*not* the multi-character "full" folds (`ß` → `ss`) or Turkic locale folds.


+
+It looks ideal: do the cheap byte work, and the instant you hit a non-ASCII byte,
+`break` and let the "real" Unicode path take over — "only do the cheap work until you have to." On an Apple M4 this runs at about
+**3 GiB/s**. That sounds fine in isolation, but it is more than **15× short** of "optimal" because of the `if` branchs.


+let mut high_bit_acc: u8 = 0;
+for b in & mut bytes {
+high_bit_acc |= * b;                       // detect any non-ASCII byte
+let is_upper = b.wrapping_sub(b'A') < 26; // branchless A..=Z test
+* b |= u8::from(is_upper) < < 5;            // set bit 5 → lowercase, else no-op
+}
+if high_bit_acc & 0x80 == 0 {
+return bytes; // pure ASCII: already folded in place, no second buffer
+}


+`u64` lanes and checking all their high bits with a single
+`& 0x8080_8080_8080_8080` mask. You can build the ASCII fast path on top of that: chunk-scan to find the ASCII prefix, then run the branchless (vectorizable) convert over it. That keeps the early-exit ability — it still bails on the first non-ASCII block — while letting both halves go fast. The catch is that it reads the data
+**twice** (once to scan, once to convert), landing at about **23 GiB/s
+** — roughly half of the single-pass branchless sweep, and ~7× the naive break loop. A solid, general-purpose default; just not the absolute ceiling when you control the whole loop and can fold detection and conversion into one branch-free pass.


+> Branchless is a *pessimization* in scalar code.** Look again at the
+> table: making the body branchless while *keeping* the `break` (2.6 GiB/s) is


+40 GiB/s also means doing zero unnecessary allocation. `simple_fold` takes the input `String` *by
+value*, owning the heap buffer it can mutate and return it. If the OR-accumulator's high bit was clear, the input was pure ASCII — already folded in place — we hand the
+**same allocation** straight back, no second buffer and no copy. Otherwise we
+`memchr` to the first non-ASCII byte and scan the tail from there, leaving the output buffer
+*unallocated* (a null write cursor) until we hit a character that folds to **different bytes
+**. Text whose multibyte content never folds — CJK, Hangul, Kana, Arabic, Hebrew, symbols — also returns the original allocation untouched, never copying a byte.
+
+Why a *second* buffer rather than rewriting in place like the ASCII pass? Because folding can make the string **longer
+**: almost every fold preserves the UTF-8 length or shrinks it, but two outliers grow — U+023A (`Ⱥ`) and U+023E (
+`Ɀ`) are 2 bytes each yet fold to 3-byte characters (`ⱥ`,
+`ɀ`). Once one appears, the output no longer fits in the input's bytes, and we need somewhere new to write.


+The pure-ASCII row is the fairest fight of all: there `str::to_lowercase`
+produces the **exact same bytes
+** we do — a correct std-library baseline rather than a different operation — and even then the branch-free sweep is ~1.5× faster (40.8 vs 27.7 GiB/s), because
+`to_lowercase` still scans for the first non-ASCII byte and allocates a fresh


+let (word_idx, bit_idx, c_len) = if lead < 0xE0 {
+    (0usize, lead & 0x1F, 2usize)                         // 2-byte: word 0
+} else if lead < 0xF0 {
+    ((lead & 0x0F) as usize, bytes[read + 1] & 0x3F, 3)   // 3-byte: word = nibble
+} else {
+    ((((lead & 0x07) as usize) < < 6) | (bytes[read + 1] & 0x3F) as usize, bytes[read + 2] & 0x3F, 4usize,)// 4-byte: merge 2 bytes
+};
+// reject without decoding: clear bit ⇒ no fold
+if word_idx > = PAGE_BITMAP.len() || (PAGE_BITMAP[word_idx] >> bit_idx) & 1 == 0 {
+    read += c_len;
+    continue;
+}


gorzell · 2026-06-10T13:29:16Z

+
+The crate is [`casefold`](../README.md); the generated table and full design notes live alongside the source.
+
+[^overlong]: The byte-space arithmetic assumes the input is **well-formed, shortest-form UTF-8


We probably need to decide if we want to use footnotes, info boxes or both.

Addresses PR #125 review: move the 'Treat the absolute figures as illustrative' note out of the table intro into a [^bench] footnote defined at the bottom of the file alongside [^overlong]. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

krukow

Looks good - the main feedback from me is explain the concept/problem a bit before showing the performance table, and also explaining each of the columns and rows in that table (right now it's hard to understand what the terms refer to in the table for a reader unfamiliar with the subject).

krukow · 2026-06-10T19:44:52Z

+
+Let's walk through the evolution in detail.
+
+## Why case-folding is even important?


I would pull this up before the data table

krukow · 2026-06-10T19:45:37Z

+
+Criterion medians on an Apple M4 (single core, `target-cpu=native`).[^bench]
+
+| Workload (input size)                  | `simple_fold`  | `simd_normalizer` | `HashMap` (byte path) | `str::to_lowercase` | `simdutf` round-trip |


can we explain how to read the table for a reader who is unfamiliar with the elements in the table?

krukow · 2026-06-10T19:50:52Z

+> touches the data half as many times. It's the same lesson one more time: in the
+> hot loop, the branch is the enemy.
+
+It is genuinely faster to


Can we elevate these lessons to the top of the article as a teaser and read on to learn more?

krukow · 2026-06-10T19:55:45Z

+| A runtime `HashMap<u32, u32>`                        | ~17 KB     |
+| **This crate (paged bitmap + packed runs)**         | **1776 B** |
+
+## Takeaways


same for this - consider elevate these take aways to the top of the article as a teaser and read on to learn more?

gorzell added 2 commits June 10, 2026 14:36

Partial edits.

79ca7ed

Some formatting.

7f8feed

Copilot AI review requested due to automatic review settings June 10, 2026 12:51

gorzell requested a review from a team as a code owner June 10, 2026 12:51

Copilot started reviewing on behalf of gorzell June 10, 2026 12:51 View session

Copilot AI reviewed Jun 10, 2026

View reviewed changes

gorzell added 2 commits June 10, 2026 14:58

Move how fast towards the top.

8fa6e0c

Shorten the how fast section.

f99da81

gorzell commented Jun 10, 2026

View reviewed changes

Comment thread crates/casefold/docs/release_blog.md Outdated

gorzell added 2 commits June 10, 2026 15:26

Fix example code formatting.

f7dca5e

Move footnote to end.

49eefa4

gorzell commented Jun 10, 2026

View reviewed changes

krukow reviewed Jun 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some edits and formatting cleanup.#125

Some edits and formatting cleanup.#125
gorzell wants to merge 7 commits into
aneubeck/foldblogfrom
gorzell/blog-edits

gorzell commented Jun 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

gorzell Jun 10, 2026

Uh oh!

krukow left a comment

Uh oh!

krukow Jun 10, 2026

Uh oh!

krukow Jun 10, 2026

Uh oh!

krukow Jun 10, 2026

Uh oh!

krukow Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		- **Identifiers and protocols
		** use case-insensitive comparison of usernames, hostnames, file paths, HTTP headers, and so on.

		> Branchless is a pessimization in scalar code.** Look again at the
		> table: making the body branchless while keeping the `break` (2.6 GiB/s) is


		The crate is [`casefold`](../README.md); the generated table and full design notes live alongside the source.

		[^overlong]: The byte-space arithmetic assumes the input is **well-formed, shortest-form UTF-8


		Let's walk through the evolution in detail.

		## Why case-folding is even important?


		Criterion medians on an Apple M4 (single core, `target-cpu=native`).[^bench]

		\| Workload (input size) \| `simple_fold` \| `simd_normalizer` \| `HashMap` (byte path) \| `str::to_lowercase` \| `simdutf` round-trip \|

Conversation

gorzell commented Jun 10, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Copilot's findings

Uh oh!

Uh oh!

Uh oh!

gorzell Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

krukow left a comment

Choose a reason for hiding this comment

Uh oh!

krukow Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

krukow Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

krukow Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

krukow Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants