Fix CFF CID-keyed font subsetting: missing .notdef glyph and incorrect Charset encoding by 55728 · Pull Request #116 · prawnpdf/ttfunk

55728 · 2026-03-23T12:03:02Z

Problem

When subsetting CID-keyed CFF fonts (e.g., NotoSerifCJK.ttc), the generated PDF contains corrupted glyph data, causing certain CJK characters to not render in PDF viewers. The issue was originally reported in prawnpdf/prawn#1105.

While the original crash (NoMethodError on glyf table) was resolved in ttfunk 1.8.0, the generated PDFs still contain structural errors in the CFF data that cause glyph rendering failures.

Root Cause

Three related bugs in the CFF subsetting code:

1. `CharstringsIndex#encode_items` — missing .notdef charstring

When the charmap does not include a mapping for GID 0 (.notdef), the encoded CharstringsIndex omits the .notdef charstring. The CFF specification requires .notdef to always be present at index 0. This causes all charstring indices to be off by one.

2. `FdSelector#encode` — missing .notdef FD entry

Similarly, the encoded FD selector omits the entry for GID 0, causing a mismatch between charstring indices and their Font Dict assignments. Glyphs end up referencing the wrong Font Dict's local subroutines, producing corrupted outlines.

3. `Charset#encode` — incorrect range encoding for unsorted SIDs

In CID-keyed fonts, SIDs (String IDs) in new GID order are not necessarily in ascending order. BinUtils.rangify assumes sorted input and groups values where b - a <= 1 into ranges. When SIDs decrease (e.g., [1549, 1509]), the difference is negative, which satisfies <= 1, causing unrelated SIDs to be incorrectly merged into a single range.

Fix

CharstringsIndex: Prepend the .notdef charstring (items[0]) when the charmap does not include GID 0.
FdSelector: Prepend the .notdef FD entry ([0, self[0]]) when the charmap does not include GID 0.
Charset: Check if SIDs are in ascending order before calling rangify. If not, fall back to array format encoding.

Verification

Tested with NotoSerifCJK.ttc (CID-keyed CFF font with 65,535 glyphs and 18 Font Dicts)
All CJK characters render correctly in Chrome after the fix
CFF structure validated with Python fonttools: all charstrings decompile and draw without errors
FdSelector GID→FD mappings verified correct for all glyphs
Charset CID names verified correct for all glyphs

Reproduction

require 'prawn'

Prawn::Document.new {
  font_families.update('NotoSerifCJK' => {
    normal: { file: '/path/to/NotoSerifCJK.ttc', font: 10 }
  })
  font 'NotoSerifCJK', size: 20
  text 'こんにちは世界テスト'
}.render_file 'output.pdf'

Before this fix, some characters (e.g., こ, 世, テ) are invisible in the output PDF. After this fix, all characters render correctly.

📎 Supplementary material

Supporting reference for "Fix CFF CID-keyed font subsetting: missing .notdef glyph and incorrect Charset encoding." All spec quotations are from Adobe Technical Note #5176, "The Compact Font Format Specification" (version 1.0, 4 December 2003); section names and table numbers are quoted verbatim, and conventional section numbers are given for convenience.

CFF spec references (Adobe TN #5176)

Why `.notdef` must exist at index 0 — CharStrings INDEX

§14 CharStrings INDEX "This contains the charstrings of all the glyphs in a font stored in an INDEX structure. Charstring objects contained within this INDEX are accessed by GID. The first charstring (GID 0) must be the .notdef glyph. The number of glyphs available in a font may be determined from the count field in the INDEX."

And, from the overview of how the parallel arrays line up (the Charsets, Encodings and Glyphs overview, immediately preceding §12 Encodings):

"By definition the first glyph (GID 0) is '.notdef' and must be present in all fonts. Since this is always the case, it is not necessary to represent either the encoding (unencoded) or name (.notdef) for GID 0. Consequently, taking advantage of this optimization, the encoding and charset arrays always begin with GID 1."

Takeaway for bug #1: the CharStrings INDEX is the one place .notdef (GID 0) is not optional — it is required and occupies index 0. The charset/encoding arrays omit it (they start at GID 1), but the CharStrings INDEX does not. Dropping it shifts every subsequent charstring index by one.

FDSelect structure — and the crucial difference from charset

§19 FDSelect "The FDSelect associates an FD (Font DICT) with a glyph by specifying an FD index for that glyph. The FD index is used to access one of the Font DICTs stored in the Font DICT INDEX."

Format 0 (Table 27) — array, one byte per glyph:

Type	Name	Description
Card8	format	=0
Card8	fds[nGlyphs]	FD selector array

"Each element of the fd array (fds) represents the FD index of the corresponding glyph. … The number of glyphs (nGlyphs) is the value of the count field in the CharStrings INDEX. (This format is identical to charset format 0 except that the .notdef glyph is included in this case.)"

Format 3 (Table 28) — ranges:

Type	Name	Description
Card8	format	=3
Card16	nRanges	Number of ranges
struct	Range3[nRanges]	Range3 array (T.29)
Card16	sentinel	Sentinel GID

Range3 (Table 29): Card16 first ("First glyph index in range"), Card8 fd ("FD index for all glyphs in range").

"Each Range3 describes a group of sequential GIDs that have the same FD index. … The first range must have a 'first' GID of 0. A sentinel GID follows the last range element … (The sentinel GID is set equal to the number of glyphs in the font. That is, its value is 1 greater than the last GID in the font.)"

Takeaway for bug #2: this is the exact spec sentence that makes #2 a real bug. The charset deliberately omits .notdef; the FDSelect deliberately includes it ("the .notdef glyph is included in this case", and Format 3's "first range must have a 'first' GID of 0"). Because the CharStrings INDEX has .notdef at GID 0, the FDSelect must carry a matching entry at GID 0, or the per-glyph FD lookup is shifted.

Charset range-encoding formats

§13 Charsets "Charset data is located via the offset operand to the charset operator in the Top DICT. Each charset is described by a format-type identifier byte followed by format-specific data. Three formats are currently defined as shown in Tables 17, 18, and 20."

Format 0 (Table 17) — array of SIDs:

Type	Name	Description
Card8	format	=0
SID	glyph[nGlyphs–1]	Glyph name array

"… The number of glyphs (nGlyphs) is the value of the count field in the CharStrings INDEX. (There is one less element in the glyph name array than nGlyphs because the .notdef glyph name is omitted.)"

Format 1 (Table 18) → Range1 (Table 19):

Type	Name	Description
SID	first	First glyph in range
Card8	nLeft	Glyphs left in range (excluding first)

Format 2 (Table 20) → Range2 (Table 21):

Type	Name	Description
SID	first	First glyph in range
Card16	nLeft	Glyphs left in range (excluding first)

"Each Range1 describes a group of sequential SIDs. … This format is particularly suited to charsets that are well ordered." "Format 2 differs from format 1 only in the size of the nLeft field … This format is most suitable for fonts with a large well-ordered charset — for example, for Asian CIDFonts."

And the reason subset CID fonts trip this — §18 CID-keyed Fonts:

"The charset data, although in the same format as non-CIDFonts, will represent CIDs rather than SIDs … In a complete CIDFont the charset table will specify an identity mapping … Subset CIDFonts will generally need to use a more complex charset table representing a non-identity mapping (where CID doesn't equal GID)."

Takeaway for bug #3: the range formats encode [first, nLeft] where nLeft is a count of sequential, ascending SIDs ("glyphs left in range"). A non-identity subset charset is not ascending, so range encoding (and BinUtils.rangify, which assumes a sorted sequence) cannot represent it without corruption — array format (Format 0) must be used.

Before / after: GID → charstring-index mapping

In all three diagrams the example charmap has no entry for GID 0 (the common case when subsetting — the consumer never asked for .notdef, so it isn't in the charmap). items[n] is the original font's charstring for original GID n.

Bug #1 — `CharstringsIndex#encode_items`

charmap = {0x20 => {old:1,new:1}, 0x21 => {old:2,new:2}}

new GID	BEFORE (corrupt)		AFTER (fixed)
new GID	encoded charstring		encoded charstring
0	`items[1]`	❌	`items[0]` = .notdef	✅
1	`items[2]`	❌	`items[1]` (old GID 1)	✅
2	— dropped —	❌	`items[2]` (old GID 2)	✅

Without the prepend, index 0 is silently taken by the first real glyph, so the whole INDEX is shifted left by one and the last glyph falls off the end. Every consumer that looks up "new GID 1" gets original glyph 2's outline.

Fix: new_items.unshift(items[0]) when the charmap has no GID 0.

Bug #2 — `FdSelector#encode` (coupled to #1)

FDSelect (array format) is one FD-index byte per glyph, starting at GID 0 (Table 27: "the .notdef glyph is included"). Once #1 puts .notdef at GID 0 in the CharStrings INDEX, FDSelect must have a matching entry at GID 0 or the two arrays drift apart. self[old] = the original glyph's FD index. (fd(n) below = FD index of original GID n.)

new GID	CharStrings INDEX charstring (post #1)	FDSelect BEFORE (fd byte)		FDSelect AFTER (fd byte)
0	.notdef (`items[0]`)	`fd(old 1)`	❌	`fd(old 0)`	✅
1	`items[1]` (old 1)	`fd(old 3)`	❌	`fd(old 1)`	✅
3	`items[3]` (old 3)	— missing —	❌	`fd(old 3)`	✅

Each glyph reads the FD index sitting one slot too early → wrong Font DICT → wrong Private DICT / local subrs → the charstring's callsubr operands resolve into a different subr INDEX → corrupt outline. This is exactly the "some CJK glyphs render as garbage / invisible" symptom.

Note the coupling: fixing #1 without #2 actually makes things worse for CID fonts, because #1 introduces the GID-0 slot in CharStrings that #2's entry is needed to align against. The two must land together.

Fix: new_indices.unshift([0, self[0]]) when no GID-0 entry is present.

Bug #3 — `Charset#encode` (range vs array)

For non-CID fonts SIDs sorted by new GID happen to be ascending, so rangify is safe. For a subset CID font they are not (see §18 quote above), and rangify mis-encodes — see the numerical walkthrough below. Before/after at the format level:

	BEFORE ❌	AFTER ✅
Decision	SIDs not ascending → rangify anyway → Range format chosen	SIDs not ascending → forced Array format 0
Encoding	wrong `[first, nLeft]` ranges	SID-per-GID array
Result	❌ wrong CIDs for some GIDs	✅ every CID preserved

Fix: detect sids.each_cons(2).all? { |a,b| a <= b }; if not sorted, force array format (total_range_size = ∞, total_array_size = 0).

rangify walkthrough (negative-difference corruption)

BinUtils.rangify (doc-commented "Turns a (sorted) sequence of values …"):

def rangify(values)
  values
    .slice_when { |a, b| b - a > 1 }      # start a new range when the gap > 1
    .map { |span| [span.first, span.length - 1] }   # => [first, nLeft]
end

The output pairs are [first_SID, nLeft], matching charset Range1/Range2 exactly (Tables 19/21: first + nLeft = "glyphs left in range, excluding first").

The minimal failure: a single descending pair

sids = [1549, 1509]

slice_when compares (1549, 1509): predicate b - a > 1 → 1509 - 1549 = -40,
and -40 > 1 is false → no split → both stay in one span.
map: [span.first, span.length - 1] = [1549, 2 - 1] = [1549, 1].
A decoder reads first = 1549, nLeft = 1 → this range covers SID 1549 and
SID 1550.

new GID	intended SID	decoded SID
k	1549	1549	✅
k+1	1509	1550	❌ 1509 lost; reader sees 1550

The negative difference satisfies <= 1, so two unrelated SIDs are fused into one "sequential" range and the second glyph's CID is silently rewritten.

Full example matching the regression test

sids = [1, 10, 5, 8, 3] (the SIDs in new-GID order from charset_spec.rb)

rangify step by step (slice_when starts a new slice after the pair where the predicate is true):

pair	b − a	> 1 ?	split here?
(1, 10)	9	yes	yes → close `[1]`
(10, 5)	−5	no	no → keep `10,5` together
(5, 8)	3	yes	yes → close `[10, 5]`
(8, 3)	−5	no	no → keep `8,3` together

Spans: [1], [10, 5], [8, 3] → ranges [[1, 0], [10, 1], [8, 1]]

Decoded back as charset ranges (first, nLeft):

range	covers (first … first+nLeft)	assigned to GIDs	result
`[1, 0]`	SID 1	GID 1	✅
`[10, 1]`	SID 10, SID 11	GID 2, GID 3	❌ GID 3 → 11, want 5
`[8, 1]`	SID 8, SID 9	GID 4, GID 5	❌ GID 5 → 9, want 3

new GID	1	2	3	4	5
want	1	10	5	8	3
rangify	1	10	❌ 11	8	❌ 9

GID 3 and GID 5 are corrupted.

Cost check that makes BEFORE actually pick the broken encoding:

range_max = max(0, 1, 1) = 1 → range_bytes = (log2(1)/8).floor + 1 = 1 →
range_format8.
total_range_size = 2*3 + 1*3 = 9; total_array_size = 5 * 2 = 10.
total_array_size (10) <= total_range_size (9) → false → range format is
chosen → corruption ships.

AFTER the fix: sids_sorted is false → total_range_size = ∞, total_array_size = 0 → array format → encoded bytes 00 | 0001 000A 0005 0008 0003 → decoded SIDs [1, 10, 5, 8, 3] exactly. Matches the new spec: encoded.bytes[0] == 0 (array format) and the round-tripped SIDs equal the input.

_{Section names and all table numbers / quotations above are transcribed directly from TN #5176. The conventional section numbers used (12 Encodings, 13 Charsets, 14 CharStrings INDEX, 18 CID-keyed Fonts, 19 FDSelect) are the standard numbering of the 2003 spec; if the published comment needs to be airtight, cite by table number (17–22 for charset, 27–29 for FDSelect), which is unambiguous.}

When subsetting CID-keyed CFF fonts (e.g., NotoSerifCJK.ttc), the encoded CFF data contains structural errors that cause certain glyphs to not render in PDF viewers. Three bugs are fixed: 1. CharstringsIndex#encode_items omits the .notdef charstring when the charmap has no mapping for GID 0. The CFF spec requires .notdef at index 0; its absence shifts all charstring indices by one. 2. FdSelector#encode similarly omits the .notdef entry, causing a mismatch between charstring indices and Font Dict assignments. Glyphs end up referencing the wrong Font Dict's local subroutines, producing corrupt outlines. 3. Charset#encode passes unsorted SIDs to BinUtils.rangify, which assumes sorted input. In CID-keyed fonts SIDs ordered by new GID are not necessarily ascending, so rangify merges unrelated SIDs into incorrect ranges. The fix falls back to array format when SIDs are not sorted. Tested with NotoSerifCJK.ttc (65,535 glyphs, 18 Font Dicts). Verified correct rendering and CFF structure with fonttools. Ref: prawnpdf/prawn#1105

This was referenced Mar 24, 2026

Fix font subset tag generation according to PDF spec #107

Open

Fix subset font rendering in both Acrobat and Illustrator #117

Open

55728 mentioned this pull request Apr 4, 2026

Support ActualText in marked content for text extraction yob/pdf-reader#587

Merged

55728 force-pushed the fix-cff-fd-selector-off-by-one branch from 95d7379 to 024855e Compare June 5, 2026 14:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix CFF CID-keyed font subsetting: missing .notdef glyph and incorrect Charset encoding#116

Fix CFF CID-keyed font subsetting: missing .notdef glyph and incorrect Charset encoding#116
55728 wants to merge 1 commit into
prawnpdf:masterfrom
55728:fix-cff-fd-selector-off-by-one

55728 commented Mar 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

55728 commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root Cause

1. CharstringsIndex#encode_items — missing .notdef charstring

2. FdSelector#encode — missing .notdef FD entry

3. Charset#encode — incorrect range encoding for unsorted SIDs

Fix

Verification

Reproduction

📎 Supplementary material

Why .notdef must exist at index 0 — CharStrings INDEX

FDSelect structure — and the crucial difference from charset

Charset range-encoding formats

Bug #1 — CharstringsIndex#encode_items

Bug #2 — FdSelector#encode (coupled to #1)

Bug #3 — Charset#encode (range vs array)

The minimal failure: a single descending pair

Full example matching the regression test

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant

55728 commented Mar 23, 2026 •

edited

Loading

1. `CharstringsIndex#encode_items` — missing .notdef charstring

2. `FdSelector#encode` — missing .notdef FD entry

3. `Charset#encode` — incorrect range encoding for unsorted SIDs

Why `.notdef` must exist at index 0 — CharStrings INDEX

Bug #1 — `CharstringsIndex#encode_items`

Bug #2 — `FdSelector#encode` (coupled to #1)

Bug #3 — `Charset#encode` (range vs array)