Skip to content

Fix CFF CID-keyed font subsetting: missing .notdef glyph and incorrect Charset encoding#116

Open
55728 wants to merge 1 commit into
prawnpdf:masterfrom
55728:fix-cff-fd-selector-off-by-one
Open

Fix CFF CID-keyed font subsetting: missing .notdef glyph and incorrect Charset encoding#116
55728 wants to merge 1 commit into
prawnpdf:masterfrom
55728:fix-cff-fd-selector-off-by-one

Conversation

@55728

@55728 55728 commented Mar 23, 2026

Copy link
Copy Markdown
Contributor

Problem

When subsetting CID-keyed CFF fonts (e.g., NotoSerifCJK.ttc), the generated PDF contains corrupted glyph data, causing certain CJK characters to not render in PDF viewers. The issue was originally reported in prawnpdf/prawn#1105.

While the original crash (NoMethodError on glyf table) was resolved in ttfunk 1.8.0, the generated PDFs still contain structural errors in the CFF data that cause glyph rendering failures.

Root Cause

Three related bugs in the CFF subsetting code:

1. CharstringsIndex#encode_items — missing .notdef charstring

When the charmap does not include a mapping for GID 0 (.notdef), the encoded CharstringsIndex omits the .notdef charstring. The CFF specification requires .notdef to always be present at index 0. This causes all charstring indices to be off by one.

2. FdSelector#encode — missing .notdef FD entry

Similarly, the encoded FD selector omits the entry for GID 0, causing a mismatch between charstring indices and their Font Dict assignments. Glyphs end up referencing the wrong Font Dict's local subroutines, producing corrupted outlines.

3. Charset#encode — incorrect range encoding for unsorted SIDs

In CID-keyed fonts, SIDs (String IDs) in new GID order are not necessarily in ascending order. BinUtils.rangify assumes sorted input and groups values where b - a <= 1 into ranges. When SIDs decrease (e.g., [1549, 1509]), the difference is negative, which satisfies <= 1, causing unrelated SIDs to be incorrectly merged into a single range.

Fix

  • CharstringsIndex: Prepend the .notdef charstring (items[0]) when the charmap does not include GID 0.
  • FdSelector: Prepend the .notdef FD entry ([0, self[0]]) when the charmap does not include GID 0.
  • Charset: Check if SIDs are in ascending order before calling rangify. If not, fall back to array format encoding.

Verification

  • Tested with NotoSerifCJK.ttc (CID-keyed CFF font with 65,535 glyphs and 18 Font Dicts)
  • All CJK characters render correctly in Chrome after the fix
  • CFF structure validated with Python fonttools: all charstrings decompile and draw without errors
  • FdSelector GID→FD mappings verified correct for all glyphs
  • Charset CID names verified correct for all glyphs

Reproduction

require 'prawn'

Prawn::Document.new {
  font_families.update('NotoSerifCJK' => {
    normal: { file: '/path/to/NotoSerifCJK.ttc', font: 10 }
  })
  font 'NotoSerifCJK', size: 20
  text 'こんにちは世界テスト'
}.render_file 'output.pdf'

Before this fix, some characters (e.g., こ, 世, テ) are invisible in the output PDF. After this fix, all characters render correctly.


📎 Supplementary material

Supporting reference for "Fix CFF CID-keyed font subsetting: missing .notdef glyph and incorrect Charset encoding." All spec quotations are from Adobe Technical Note #5176, "The Compact Font Format Specification" (version 1.0, 4 December 2003); section names and table numbers are quoted verbatim, and conventional section numbers are given for convenience.

CFF spec references (Adobe TN #5176)

Why .notdef must exist at index 0 — CharStrings INDEX

§14 CharStrings INDEX "This contains the charstrings of all the glyphs in a font stored in an INDEX structure. Charstring objects contained within this INDEX are accessed by GID. The first charstring (GID 0) must be the .notdef glyph. The number of glyphs available in a font may be determined from the count field in the INDEX."

And, from the overview of how the parallel arrays line up (the Charsets, Encodings and Glyphs overview, immediately preceding §12 Encodings):

"By definition the first glyph (GID 0) is '.notdef' and must be present in all fonts. Since this is always the case, it is not necessary to represent either the encoding (unencoded) or name (.notdef) for GID 0. Consequently, taking advantage of this optimization, the encoding and charset arrays always begin with GID 1."

Takeaway for bug #1: the CharStrings INDEX is the one place .notdef (GID 0) is not optional — it is required and occupies index 0. The charset/encoding arrays omit it (they start at GID 1), but the CharStrings INDEX does not. Dropping it shifts every subsequent charstring index by one.

FDSelect structure — and the crucial difference from charset

§19 FDSelect "The FDSelect associates an FD (Font DICT) with a glyph by specifying an FD index for that glyph. The FD index is used to access one of the Font DICTs stored in the Font DICT INDEX."

Format 0 (Table 27) — array, one byte per glyph:

Type Name Description
Card8 format =0
Card8 fds[nGlyphs] FD selector array

"Each element of the fd array (fds) represents the FD index of the corresponding glyph. … The number of glyphs (nGlyphs) is the value of the count field in the CharStrings INDEX. (This format is identical to charset format 0 except that the .notdef glyph is included in this case.)"

Format 3 (Table 28) — ranges:

Type Name Description
Card8 format =3
Card16 nRanges Number of ranges
struct Range3[nRanges] Range3 array (T.29)
Card16 sentinel Sentinel GID

Range3 (Table 29): Card16 first ("First glyph index in range"), Card8 fd ("FD index for all glyphs in range").

"Each Range3 describes a group of sequential GIDs that have the same FD index. … The first range must have a 'first' GID of 0. A sentinel GID follows the last range element … (The sentinel GID is set equal to the number of glyphs in the font. That is, its value is 1 greater than the last GID in the font.)"

Takeaway for bug #2: this is the exact spec sentence that makes #2 a real bug. The charset deliberately omits .notdef; the FDSelect deliberately includes it ("the .notdef glyph is included in this case", and Format 3's "first range must have a 'first' GID of 0"). Because the CharStrings INDEX has .notdef at GID 0, the FDSelect must carry a matching entry at GID 0, or the per-glyph FD lookup is shifted.

Charset range-encoding formats

§13 Charsets "Charset data is located via the offset operand to the charset operator in the Top DICT. Each charset is described by a format-type identifier byte followed by format-specific data. Three formats are currently defined as shown in Tables 17, 18, and 20."

Format 0 (Table 17) — array of SIDs:

Type Name Description
Card8 format =0
SID glyph[nGlyphs–1] Glyph name array

"… The number of glyphs (nGlyphs) is the value of the count field in the CharStrings INDEX. (There is one less element in the glyph name array than nGlyphs because the .notdef glyph name is omitted.)"

Format 1 (Table 18) → Range1 (Table 19):

Type Name Description
SID first First glyph in range
Card8 nLeft Glyphs left in range (excluding first)

Format 2 (Table 20) → Range2 (Table 21):

Type Name Description
SID first First glyph in range
Card16 nLeft Glyphs left in range (excluding first)

"Each Range1 describes a group of sequential SIDs. … This format is particularly suited to charsets that are well ordered." "Format 2 differs from format 1 only in the size of the nLeft field … This format is most suitable for fonts with a large well-ordered charset — for example, for Asian CIDFonts."

And the reason subset CID fonts trip this — §18 CID-keyed Fonts:

"The charset data, although in the same format as non-CIDFonts, will represent CIDs rather than SIDs … In a complete CIDFont the charset table will specify an identity mapping … Subset CIDFonts will generally need to use a more complex charset table representing a non-identity mapping (where CID doesn't equal GID)."

Takeaway for bug #3: the range formats encode [first, nLeft] where nLeft is a count of sequential, ascending SIDs ("glyphs left in range"). A non-identity subset charset is not ascending, so range encoding (and BinUtils.rangify, which assumes a sorted sequence) cannot represent it without corruption — array format (Format 0) must be used.

Before / after: GID → charstring-index mapping

In all three diagrams the example charmap has no entry for GID 0 (the common case when subsetting — the consumer never asked for .notdef, so it isn't in the charmap). items[n] is the original font's charstring for original GID n.

Bug #1CharstringsIndex#encode_items

charmap = {0x20 => {old:1,new:1}, 0x21 => {old:2,new:2}}

new GIDBEFORE (corrupt)AFTER (fixed)
encoded charstringencoded charstring
0items[1]items[0] = .notdef
1items[2]items[1] (old GID 1)
2— dropped —items[2] (old GID 2)

Without the prepend, index 0 is silently taken by the first real glyph, so the whole INDEX is shifted left by one and the last glyph falls off the end. Every consumer that looks up "new GID 1" gets original glyph 2's outline.

Fix: new_items.unshift(items[0]) when the charmap has no GID 0.

Bug #2FdSelector#encode (coupled to #1)

FDSelect (array format) is one FD-index byte per glyph, starting at GID 0 (Table 27: "the .notdef glyph is included"). Once #1 puts .notdef at GID 0 in the CharStrings INDEX, FDSelect must have a matching entry at GID 0 or the two arrays drift apart. self[old] = the original glyph's FD index. (fd(n) below = FD index of original GID n.)

new GIDCharStrings INDEX charstring (post #1)FDSelect BEFORE (fd byte)FDSelect AFTER (fd byte)
0.notdef (items[0])fd(old 1)fd(old 0)
1items[1] (old 1)fd(old 3)fd(old 1)
3items[3] (old 3)— missing —fd(old 3)

Each glyph reads the FD index sitting one slot too early → wrong Font DICT → wrong Private DICT / local subrs → the charstring's callsubr operands resolve into a different subr INDEX → corrupt outline. This is exactly the "some CJK glyphs render as garbage / invisible" symptom.

Note the coupling: fixing #1 without #2 actually makes things worse for CID fonts, because #1 introduces the GID-0 slot in CharStrings that #2's entry is needed to align against. The two must land together.

Fix: new_indices.unshift([0, self[0]]) when no GID-0 entry is present.

Bug #3Charset#encode (range vs array)

For non-CID fonts SIDs sorted by new GID happen to be ascending, so rangify is safe. For a subset CID font they are not (see §18 quote above), and rangify mis-encodes — see the numerical walkthrough below. Before/after at the format level:

BEFORE ❌AFTER ✅
DecisionSIDs not ascending → rangify anyway → Range format chosenSIDs not ascending → forced Array format 0
Encodingwrong [first, nLeft] rangesSID-per-GID array
Result❌ wrong CIDs for some GIDs✅ every CID preserved

Fix: detect sids.each_cons(2).all? { |a,b| a <= b }; if not sorted, force array format (total_range_size = ∞, total_array_size = 0).

rangify walkthrough (negative-difference corruption)

BinUtils.rangify (doc-commented "Turns a (sorted) sequence of values …"):

def rangify(values)
  values
    .slice_when { |a, b| b - a > 1 }      # start a new range when the gap > 1
    .map { |span| [span.first, span.length - 1] }   # => [first, nLeft]
end

The output pairs are [first_SID, nLeft], matching charset Range1/Range2 exactly (Tables 19/21: first + nLeft = "glyphs left in range, excluding first").

The minimal failure: a single descending pair

sids = [1549, 1509]

  • slice_when compares (1549, 1509): predicate b - a > 11509 - 1549 = -40,
    and -40 > 1 is falseno split → both stay in one span.
  • map: [span.first, span.length - 1] = [1549, 2 - 1] = [1549, 1].
  • A decoder reads first = 1549, nLeft = 1 → this range covers SID 1549 and
    SID 1550
    .
new GIDintended SIDdecoded SID
k15491549
k+115091550❌ 1509 lost; reader sees 1550

The negative difference satisfies <= 1, so two unrelated SIDs are fused into one "sequential" range and the second glyph's CID is silently rewritten.

Full example matching the regression test

sids = [1, 10, 5, 8, 3] (the SIDs in new-GID order from charset_spec.rb)

rangify step by step (slice_when starts a new slice after the pair where the predicate is true):

pairb − a> 1 ?split here?
(1, 10)9yesyes → close [1]
(10, 5)−5nono → keep 10,5 together
(5, 8)3yesyes → close [10, 5]
(8, 3)−5nono → keep 8,3 together

Spans: [1], [10, 5], [8, 3] → ranges [[1, 0], [10, 1], [8, 1]]

Decoded back as charset ranges (first, nLeft):

rangecovers (first … first+nLeft)assigned to GIDsresult
[1, 0]SID 1GID 1
[10, 1]SID 10, SID 11GID 2, GID 3❌ GID 3 → 11, want 5
[8, 1]SID 8, SID 9GID 4, GID 5❌ GID 5 → 9, want 3
new GID12345
want110583
rangify110❌ 118❌ 9

GID 3 and GID 5 are corrupted.

Cost check that makes BEFORE actually pick the broken encoding:

  • range_max = max(0, 1, 1) = 1range_bytes = (log2(1)/8).floor + 1 = 1
    range_format8.
  • total_range_size = 2*3 + 1*3 = 9; total_array_size = 5 * 2 = 10.
  • total_array_size (10) <= total_range_size (9)false → range format is
    chosen → corruption ships.

AFTER the fix: sids_sorted is false → total_range_size = ∞, total_array_size = 0 → array format → encoded bytes 00 | 0001 000A 0005 0008 0003 → decoded SIDs [1, 10, 5, 8, 3] exactly. Matches the new spec: encoded.bytes[0] == 0 (array format) and the round-tripped SIDs equal the input.

Section names and all table numbers / quotations above are transcribed directly from TN #5176. The conventional section numbers used (12 Encodings, 13 Charsets, 14 CharStrings INDEX, 18 CID-keyed Fonts, 19 FDSelect) are the standard numbering of the 2003 spec; if the published comment needs to be airtight, cite by table number (17–22 for charset, 27–29 for FDSelect), which is unambiguous.

When subsetting CID-keyed CFF fonts (e.g., NotoSerifCJK.ttc), the
encoded CFF data contains structural errors that cause certain glyphs
to not render in PDF viewers.

Three bugs are fixed:

1. CharstringsIndex#encode_items omits the .notdef charstring when
   the charmap has no mapping for GID 0. The CFF spec requires
   .notdef at index 0; its absence shifts all charstring indices by
   one.

2. FdSelector#encode similarly omits the .notdef entry, causing a
   mismatch between charstring indices and Font Dict assignments.
   Glyphs end up referencing the wrong Font Dict's local
   subroutines, producing corrupt outlines.

3. Charset#encode passes unsorted SIDs to BinUtils.rangify, which
   assumes sorted input. In CID-keyed fonts SIDs ordered by new GID
   are not necessarily ascending, so rangify merges unrelated SIDs
   into incorrect ranges. The fix falls back to array format when
   SIDs are not sorted.

Tested with NotoSerifCJK.ttc (65,535 glyphs, 18 Font Dicts).
Verified correct rendering and CFF structure with fonttools.

Ref: prawnpdf/prawn#1105
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant