Fix CFF CID-keyed font subsetting: missing .notdef glyph and incorrect Charset encoding#116
Open
55728 wants to merge 1 commit into
Open
Fix CFF CID-keyed font subsetting: missing .notdef glyph and incorrect Charset encoding#11655728 wants to merge 1 commit into
55728 wants to merge 1 commit into
Conversation
When subsetting CID-keyed CFF fonts (e.g., NotoSerifCJK.ttc), the encoded CFF data contains structural errors that cause certain glyphs to not render in PDF viewers. Three bugs are fixed: 1. CharstringsIndex#encode_items omits the .notdef charstring when the charmap has no mapping for GID 0. The CFF spec requires .notdef at index 0; its absence shifts all charstring indices by one. 2. FdSelector#encode similarly omits the .notdef entry, causing a mismatch between charstring indices and Font Dict assignments. Glyphs end up referencing the wrong Font Dict's local subroutines, producing corrupt outlines. 3. Charset#encode passes unsorted SIDs to BinUtils.rangify, which assumes sorted input. In CID-keyed fonts SIDs ordered by new GID are not necessarily ascending, so rangify merges unrelated SIDs into incorrect ranges. The fix falls back to array format when SIDs are not sorted. Tested with NotoSerifCJK.ttc (65,535 glyphs, 18 Font Dicts). Verified correct rendering and CFF structure with fonttools. Ref: prawnpdf/prawn#1105
This was referenced Mar 24, 2026
95d7379 to
024855e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When subsetting CID-keyed CFF fonts (e.g., NotoSerifCJK.ttc), the generated PDF contains corrupted glyph data, causing certain CJK characters to not render in PDF viewers. The issue was originally reported in prawnpdf/prawn#1105.
While the original crash (
NoMethodErroronglyftable) was resolved in ttfunk 1.8.0, the generated PDFs still contain structural errors in the CFF data that cause glyph rendering failures.Root Cause
Three related bugs in the CFF subsetting code:
1.
CharstringsIndex#encode_items— missing .notdef charstringWhen the charmap does not include a mapping for GID 0 (
.notdef), the encoded CharstringsIndex omits the.notdefcharstring. The CFF specification requires.notdefto always be present at index 0. This causes all charstring indices to be off by one.2.
FdSelector#encode— missing .notdef FD entrySimilarly, the encoded FD selector omits the entry for GID 0, causing a mismatch between charstring indices and their Font Dict assignments. Glyphs end up referencing the wrong Font Dict's local subroutines, producing corrupted outlines.
3.
Charset#encode— incorrect range encoding for unsorted SIDsIn CID-keyed fonts, SIDs (String IDs) in new GID order are not necessarily in ascending order.
BinUtils.rangifyassumes sorted input and groups values whereb - a <= 1into ranges. When SIDs decrease (e.g.,[1549, 1509]), the difference is negative, which satisfies<= 1, causing unrelated SIDs to be incorrectly merged into a single range.Fix
.notdefcharstring (items[0]) when the charmap does not include GID 0..notdefFD entry ([0, self[0]]) when the charmap does not include GID 0.rangify. If not, fall back to array format encoding.Verification
Reproduction
Before this fix, some characters (e.g., こ, 世, テ) are invisible in the output PDF. After this fix, all characters render correctly.
📎 Supplementary material
Supporting reference for "Fix CFF CID-keyed font subsetting: missing
.notdefglyph and incorrect Charset encoding." All spec quotations are from Adobe Technical Note #5176, "The Compact Font Format Specification" (version 1.0, 4 December 2003); section names and table numbers are quoted verbatim, and conventional section numbers are given for convenience.CFF spec references (Adobe TN #5176)
Why
.notdefmust exist at index 0 — CharStrings INDEXAnd, from the overview of how the parallel arrays line up (the Charsets, Encodings and Glyphs overview, immediately preceding §12 Encodings):
Takeaway for bug #1: the CharStrings INDEX is the one place
.notdef(GID 0) is not optional — it is required and occupies index 0. The charset/encoding arrays omit it (they start at GID 1), but the CharStrings INDEX does not. Dropping it shifts every subsequent charstring index by one.FDSelect structure — and the crucial difference from charset
Format 0 (Table 27) — array, one byte per glyph:
Format 3 (Table 28) — ranges:
Range3 (Table 29):
Card16 first("First glyph index in range"),Card8 fd("FD index for all glyphs in range").Takeaway for bug #2: this is the exact spec sentence that makes #2 a real bug. The charset deliberately omits
.notdef; the FDSelect deliberately includes it ("the .notdef glyph is included in this case", and Format 3's "first range must have a 'first' GID of 0"). Because the CharStrings INDEX has.notdefat GID 0, the FDSelect must carry a matching entry at GID 0, or the per-glyph FD lookup is shifted.Charset range-encoding formats
Format 0 (Table 17) — array of SIDs:
Format 1 (Table 18) → Range1 (Table 19):
Format 2 (Table 20) → Range2 (Table 21):
And the reason subset CID fonts trip this — §18 CID-keyed Fonts:
Takeaway for bug #3: the range formats encode
[first, nLeft]wherenLeftis a count of sequential, ascending SIDs ("glyphs left in range"). A non-identity subset charset is not ascending, so range encoding (andBinUtils.rangify, which assumes a sorted sequence) cannot represent it without corruption — array format (Format 0) must be used.Before / after: GID → charstring-index mapping
In all three diagrams the example charmap has no entry for GID 0 (the common case when subsetting — the consumer never asked for
.notdef, so it isn't in the charmap).items[n]is the original font's charstring for original GIDn.Bug #1 —
CharstringsIndex#encode_itemscharmap =
{0x20 => {old:1,new:1}, 0x21 => {old:2,new:2}}items[1]items[0]= .notdefitems[2]items[1](old GID 1)items[2](old GID 2)Without the prepend, index 0 is silently taken by the first real glyph, so the whole INDEX is shifted left by one and the last glyph falls off the end. Every consumer that looks up "new GID 1" gets original glyph 2's outline.
Fix:
new_items.unshift(items[0])when the charmap has no GID 0.Bug #2 —
FdSelector#encode(coupled to #1)FDSelect (array format) is one FD-index byte per glyph, starting at GID 0 (Table 27: "the .notdef glyph is included"). Once #1 puts
.notdefat GID 0 in the CharStrings INDEX, FDSelect must have a matching entry at GID 0 or the two arrays drift apart.self[old]= the original glyph's FD index. (fd(n)below = FD index of original GIDn.)items[0])fd(old 1)fd(old 0)items[1](old 1)fd(old 3)fd(old 1)items[3](old 3)fd(old 3)Each glyph reads the FD index sitting one slot too early → wrong Font DICT → wrong Private DICT / local subrs → the charstring's
callsubroperands resolve into a different subr INDEX → corrupt outline. This is exactly the "some CJK glyphs render as garbage / invisible" symptom.Note the coupling: fixing #1 without #2 actually makes things worse for CID fonts, because #1 introduces the GID-0 slot in CharStrings that #2's entry is needed to align against. The two must land together.
Fix:
new_indices.unshift([0, self[0]])when no GID-0 entry is present.Bug #3 —
Charset#encode(range vs array)For non-CID fonts SIDs sorted by new GID happen to be ascending, so
rangifyis safe. For a subset CID font they are not (see §18 quote above), andrangifymis-encodes — see the numerical walkthrough below. Before/after at the format level:[first, nLeft]rangesFix: detect
sids.each_cons(2).all? { |a,b| a <= b }; if not sorted, force array format (total_range_size = ∞,total_array_size = 0).rangifywalkthrough (negative-difference corruption)BinUtils.rangify(doc-commented "Turns a (sorted) sequence of values …"):The output pairs are
[first_SID, nLeft], matching charset Range1/Range2 exactly (Tables 19/21:first+nLeft= "glyphs left in range, excluding first").The minimal failure: a single descending pair
sids = [1549, 1509]slice_whencompares(1549, 1509): predicateb - a > 1→1509 - 1549 = -40,and
-40 > 1is false → no split → both stay in one span.[span.first, span.length - 1]=[1549, 2 - 1]=[1549, 1].first = 1549, nLeft = 1→ this range covers SID 1549 andSID 1550.
The negative difference satisfies
<= 1, so two unrelated SIDs are fused into one "sequential" range and the second glyph's CID is silently rewritten.Full example matching the regression test
sids = [1, 10, 5, 8, 3](the SIDs in new-GID order fromcharset_spec.rb)rangifystep by step (slice_whenstarts a new slice after the pair where the predicate is true):[1]10,5together[10, 5]8,3togetherSpans:
[1],[10, 5],[8, 3]→ ranges[[1, 0], [10, 1], [8, 1]]Decoded back as charset ranges
(first, nLeft):[1, 0][10, 1][8, 1]GID 3 and GID 5 are corrupted.
Cost check that makes BEFORE actually pick the broken encoding:
range_max = max(0, 1, 1) = 1→range_bytes = (log2(1)/8).floor + 1 = 1→range_format8.
total_range_size = 2*3 + 1*3 = 9;total_array_size = 5 * 2 = 10.total_array_size (10) <= total_range_size (9)→ false → range format ischosen → corruption ships.
AFTER the fix:
sids_sortedis false →total_range_size = ∞,total_array_size = 0→ array format → encoded bytes00 | 0001 000A 0005 0008 0003→ decoded SIDs[1, 10, 5, 8, 3]exactly. Matches the new spec:encoded.bytes[0] == 0(array format) and the round-tripped SIDs equal the input.Section names and all table numbers / quotations above are transcribed directly from TN #5176. The conventional section numbers used (12 Encodings, 13 Charsets, 14 CharStrings INDEX, 18 CID-keyed Fonts, 19 FDSelect) are the standard numbering of the 2003 spec; if the published comment needs to be airtight, cite by table number (17–22 for charset, 27–29 for FDSelect), which is unambiguous.