Skip to content

Some edits and formatting cleanup.#125

Open
gorzell wants to merge 7 commits into
aneubeck/foldblogfrom
gorzell/blog-edits
Open

Some edits and formatting cleanup.#125
gorzell wants to merge 7 commits into
aneubeck/foldblogfrom
gorzell/blog-edits

Conversation

@gorzell

@gorzell gorzell commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Copilot AI review requested due to automatic review settings June 10, 2026 12:51
@gorzell gorzell requested a review from a team as a code owner June 10, 2026 12:51

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR relocates the casefold performance/release blog content from crates/casefold/BLOG.md into a docs-targeted Markdown file intended for rendering from crates/casefold/docs/.

Changes:

  • Added crates/casefold/docs/release_blog.md containing the blog post content.
  • Removed the previous crates/casefold/BLOG.md version of the post.
Show a summary per file
File Description
crates/casefold/docs/release_blog.md New docs-hosted blog Markdown; currently contains several Markdown/emphasis and Rust snippet formatting issues that affect rendering/copy-paste correctness.
crates/casefold/BLOG.md Deleted the prior blog post file from the crate root (content moved to docs).

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 2/2 changed files
  • Comments generated: 11

Comment on lines +20 to +21
- **Identifiers and protocols
** use case-insensitive comparison of usernames, hostnames, file paths, HTTP headers, and so on.
Comment on lines +35 to +39
These diverge on real characters — `ß`,
`İ`, final sigma — and lowercasing as a stand-in silently produces incorrect matches. This crate implements the **simple
** (1-to-1) folds — statuses `C` and `S` in [
`CaseFolding.txt`](https://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt) — and deliberately
*not* the multi-character "full" folds (`ß` → `ss`) or Turkic locale folds.
Comment thread crates/casefold/docs/release_blog.md

It looks ideal: do the cheap byte work, and the instant you hit a non-ASCII byte,
`break` and let the "real" Unicode path take over — "only do the cheap work until you have to." On an Apple M4 this runs at about
**3 GiB/s**. That sounds fine in isolation, but it is more than **15× short** of "optimal" because of the `if` branchs.
Comment on lines +80 to +88
let mut high_bit_acc: u8 = 0;
for b in & mut bytes {
high_bit_acc |= * b; // detect any non-ASCII byte
let is_upper = b.wrapping_sub(b'A') < 26; // branchless A..=Z test
* b |= u8::from(is_upper) < < 5; // set bit 5 → lowercase, else no-op
}
if high_bit_acc & 0x80 == 0 {
return bytes; // pure ASCII: already folded in place, no second buffer
}
Comment on lines +127 to +130
`u64` lanes and checking all their high bits with a single
`& 0x8080_8080_8080_8080` mask. You can build the ASCII fast path on top of that: chunk-scan to find the ASCII prefix, then run the branchless (vectorizable) convert over it. That keeps the early-exit ability — it still bails on the first non-ASCII block — while letting both halves go fast. The catch is that it reads the data
**twice** (once to scan, once to convert), landing at about **23 GiB/s
** — roughly half of the single-pass branchless sweep, and ~7× the naive break loop. A solid, general-purpose default; just not the absolute ceiling when you control the whole loop and can fold detection and conversion into one branch-free pass.
Comment on lines +111 to +112
> Branchless is a *pessimization* in scalar code.** Look again at the
> table: making the body branchless while *keeping* the `break` (2.6 GiB/s) is
Comment on lines +153 to +163
40 GiB/s also means doing zero unnecessary allocation. `simple_fold` takes the input `String` *by
value*, owning the heap buffer it can mutate and return it. If the OR-accumulator's high bit was clear, the input was pure ASCII — already folded in place — we hand the
**same allocation** straight back, no second buffer and no copy. Otherwise we
`memchr` to the first non-ASCII byte and scan the tail from there, leaving the output buffer
*unallocated* (a null write cursor) until we hit a character that folds to **different bytes
**. Text whose multibyte content never folds — CJK, Hangul, Kana, Arabic, Hebrew, symbols — also returns the original allocation untouched, never copying a byte.

Why a *second* buffer rather than rewriting in place like the ASCII pass? Because folding can make the string **longer
**: almost every fold preserves the UTF-8 length or shrinks it, but two outliers grow — U+023A (`Ⱥ`) and U+023E (
`Ɀ`) are 2 bytes each yet fold to 3-byte characters (`ⱥ`,
`ɀ`). Once one appears, the output no longer fits in the input's bytes, and we need somewhere new to write.
Comment thread crates/casefold/docs/release_blog.md Outdated
Comment on lines +406 to +409
The pure-ASCII row is the fairest fight of all: there `str::to_lowercase`
produces the **exact same bytes
** we do — a correct std-library baseline rather than a different operation — and even then the branch-free sweep is ~1.5× faster (40.8 vs 27.7 GiB/s), because
`to_lowercase` still scans for the first non-ASCII byte and allocates a fresh
Comment on lines +220 to +231
let (word_idx, bit_idx, c_len) = if lead < 0xE0 {
(0usize, lead & 0x1F, 2usize) // 2-byte: word 0
} else if lead < 0xF0 {
((lead & 0x0F) as usize, bytes[read + 1] & 0x3F, 3) // 3-byte: word = nibble
} else {
((((lead & 0x07) as usize) < < 6) | (bytes[read + 1] & 0x3F) as usize, bytes[read + 2] & 0x3F, 4usize,)// 4-byte: merge 2 bytes
};
// reject without decoding: clear bit ⇒ no fold
if word_idx > = PAGE_BITMAP.len() || (PAGE_BITMAP[word_idx] >> bit_idx) & 1 == 0 {
read += c_len;
continue;
}
Comment thread crates/casefold/docs/release_blog.md Outdated

The crate is [`casefold`](../README.md); the generated table and full design notes live alongside the source.

[^overlong]: The byte-space arithmetic assumes the input is **well-formed, shortest-form UTF-8

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably need to decide if we want to use footnotes, info boxes or both.

Addresses PR #125 review: move the 'Treat the absolute figures as
illustrative' note out of the table intro into a [^bench] footnote
defined at the bottom of the file alongside [^overlong].

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@krukow krukow left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - the main feedback from me is explain the concept/problem a bit before showing the performance table, and also explaining each of the columns and rows in that table (right now it's hard to understand what the terms refer to in the table for a reader unfamiliar with the subject).


Let's walk through the evolution in detail.

## Why case-folding is even important?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would pull this up before the data table


Criterion medians on an Apple M4 (single core, `target-cpu=native`).[^bench]

| Workload (input size) | `simple_fold` | `simd_normalizer` | `HashMap` (byte path) | `str::to_lowercase` | `simdutf` round-trip |

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we explain how to read the table for a reader who is unfamiliar with the elements in the table?

> touches the data half as many times. It's the same lesson one more time: in the
> hot loop, the branch is the enemy.

It is genuinely faster to

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we elevate these lessons to the top of the article as a teaser and read on to learn more?

| A runtime `HashMap<u32, u32>` | ~17 KB |
| **This crate (paged bitmap + packed runs)** | **1776 B** |

## Takeaways

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same for this - consider elevate these take aways to the top of the article as a teaser and read on to learn more?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants