Skip to content

Add public SentencePieceTokenizer factory methods for Unigram from vocab list and tokenizer.json#7625

Open
Copilot wants to merge 15 commits into
mainfrom
copilot/add-sentencepiece-tokenizer-api
Open

Add public SentencePieceTokenizer factory methods for Unigram from vocab list and tokenizer.json#7625
Copilot wants to merge 15 commits into
mainfrom
copilot/add-sentencepiece-tokenizer-api

Conversation

Copilot AI commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

SentencePieceTokenizer only exposed Create(Stream) requiring a SentencePiece protobuf (.model), making it impossible to load Hugging Face JSON-only Unigram tokenizers that have no .model file.

New public APIs

From in-memory vocab:

SentencePieceTokenizer.Create(
    IEnumerable<(string Piece, float Score)> vocab,
    int unkId,
    bool addBeginningOfSentence = true,
    bool addEndOfSentence = false,
    ReadOnlySpan<byte> precompiledCharsMap = default,
    bool addDummyPrefix = true,
    bool escapeWhiteSpaces = true,
    bool treatWhitespaceAsSuffix = false,
    IReadOnlyDictionary<string, int>? specialTokens = null)

From tokenizer.json:

SentencePieceTokenizer.CreateFromTokenizerJson(
    Stream tokenizerJsonStream,
    bool addBeginningOfSentence = true,
    bool addEndOfSentence = false,
    IReadOnlyDictionary<string, int>? specialTokens = null)

CreateFromTokenizerJson reads model.vocab, model.unk_id, extracts precompiled_charsmap from a Precompiled or Sequence normalizer, and reads Metaspace pre-tokenizer settings (add_prefix_space, replacement, prepend_scheme). It validates model.type == "Unigram".

Internal changes

  • SentencePieceBaseModel: new constructor taking individual config parameters instead of ModelProto
  • SentencePieceUnigramModel: new constructors building vocab from IReadOnlyList<(string, float)>; BOS/EOS/PAD IDs auto-detected by piece name (<s>, </s>, <pad>) with SentencePiece-conventional positional fallbacks

Note on token IDs

HF tokenizer.json typically uses a different special-token ordering than the SentencePiece protobuf (e.g. <s>=0, <pad>=1, </s>=2, <unk>=3 vs. <unk>=0, <s>=1, </s>=2). Piece strings produced are identical; numeric IDs will differ by the vocab offset introduced by the extra special tokens.

…erJson APIs

Co-authored-by: ericstj <8918108+ericstj@users.noreply.github.com>
Copilot AI changed the title [WIP] Add public way to construct SentencePieceTokenizer from tokenizer.json Add public SentencePieceTokenizer factory methods for Unigram from vocab list and tokenizer.json Jun 10, 2026
Copilot AI requested a review from ericstj June 10, 2026 23:42

@ericstj ericstj left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this — the in-memory Create(vocab, ...) overload is clean and the ID-preservation (JSON vocab index = token id) is the right call. I implemented this same JSON-only Unigram capability recently against real Hugging Face models, and hit two correctness issues that this PR's single test model happens to mask. Details inline; summary here.

Bugs (untested by the current suite)

  1. BOS/EOS positional fallback corrupts real pieces. FindSpecialTokenId(pieces, "<s>", 1) / ("</s>", 2) fall back to positions 1/2 when the vocab has no piece literally named <s>/</s>. Many HF Unigram tokenizers don't use those names — e.g. minishlab/potion-multilingual-128M (bge-m3 family) has unk_id=1, vocab [0]="[PAD]", [1]="[UNK]", [2]=",". There eosId→2 marks "," as Control and drops it from the Viterbi trie (it can never be emitted), and bosId→1 collides with unkId and clobbers the unknown entry. This is structural (independent of addBos/addEos).

  2. Normalizer steps beyond Precompiled are silently dropped. ExtractPrecompiledCharsMap extracts only the charsmap and discards sibling normalizers. Real Unigram models often have a richer chain (potion/bge-m3: Sequence[Precompiled, Replace(punctuation spacing), Replace("\\s+"->" "), Strip]), which SentencePieceNormalizer cannot reproduce — so CreateFromTokenizerJson silently yields different tokens than HF. Since the charsmap must run before those Replace steps, they can't just be reordered into SP; at minimum this should throw on unrecognized normalizer types rather than silently ignore them.

Why the test stays green

Paraphrase-multilingual-MiniLM-L12-v2 names its specials (<s>=0, </s>=2) so the fallback never fires, and its normalizer is a single Precompiled, so the dropped-sibling path is never hit. Recommend adding fixtures that (a) place specials at non-conventional positions / omit <s>/</s>, and (b) use a Sequence normalizer with Replace/Strip, asserting against HF reference ids.

Minor

  • removeExtraWhitespaces is hard-coded true in both factories rather than derived from the JSON.
  • added_tokens from the JSON aren't auto-wired; correctness depends on the caller passing specialTokens. Worth documenting or reading them.

(Posting as comments only — not an approval or change request.)

Comment thread src/Microsoft.ML.Tokenizers/Model/SentencePieceUnigramModel.cs Outdated
Comment thread src/Microsoft.ML.Tokenizers/Model/SentencePieceTokenizer.cs
Comment thread src/Microsoft.ML.Tokenizers/Model/SentencePieceTokenizer.cs Outdated
@ericstj

ericstj commented Jun 11, 2026

Copy link
Copy Markdown
Member

@copilot please address feedback

…end_scheme handling

Co-authored-by: ericstj <8918108+ericstj@users.noreply.github.com>
Copilot AI requested a review from ericstj June 11, 2026 00:50
Comment thread src/Microsoft.ML.Tokenizers/Model/SentencePieceTokenizer.cs Outdated
Comment thread src/Microsoft.ML.Tokenizers/Model/SentencePieceTokenizer.cs Outdated
Comment thread src/Microsoft.ML.Tokenizers/Model/SentencePieceUnigramModel.cs
Comment thread test/Microsoft.ML.Tokenizers.Tests/UnigramTests.cs
@ericstj

ericstj commented Jun 11, 2026

Copy link
Copy Markdown
Member

@copilot address feedback

…tion, and add tests

Co-authored-by: ericstj <8918108+ericstj@users.noreply.github.com>
Copilot AI requested a review from ericstj June 11, 2026 01:23
@codecov

codecov Bot commented Jun 11, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 86.69898% with 247 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.74%. Comparing base (548d4d0) to head (e1ff3c5).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
...soft.ML.Tokenizers/Model/SentencePieceTokenizer.cs 70.82% 81 Missing and 43 partials ⚠️
...izers/Normalizer/SentencePieceNormalizationStep.cs 73.91% 39 Missing and 27 partials ⚠️
...t.ML.Tokenizers/Model/SentencePieceUnigramModel.cs 77.55% 37 Missing and 7 partials ⚠️
...L.Tokenizers/Normalizer/SentencePieceNormalizer.cs 71.11% 8 Missing and 5 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7625      +/-   ##
==========================================
+ Coverage   69.59%   69.74%   +0.14%     
==========================================
  Files        1484     1485       +1     
  Lines      273606   275443    +1837     
  Branches    27949    28161     +212     
==========================================
+ Hits       190410   192096    +1686     
- Misses      75832    75916      +84     
- Partials     7364     7431      +67     
Flag Coverage Δ
Debug 69.74% <86.69%> (+0.14%) ⬆️
production 63.93% <74.21%> (+0.08%) ⬆️
test 89.78% <100.00%> (+0.15%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...soft.ML.Tokenizers/Model/SentencePieceBaseModel.cs 82.77% <100.00%> (+2.33%) ⬆️
test/Microsoft.ML.Tokenizers.Tests/UnigramTests.cs 97.63% <100.00%> (+4.66%) ⬆️
...L.Tokenizers/Normalizer/SentencePieceNormalizer.cs 75.13% <71.11%> (+4.80%) ⬆️
...t.ML.Tokenizers/Model/SentencePieceUnigramModel.cs 67.23% <77.55%> (+6.84%) ⬆️
...izers/Normalizer/SentencePieceNormalizationStep.cs 73.91% <73.91%> (ø)
...soft.ML.Tokenizers/Model/SentencePieceTokenizer.cs 74.45% <70.82%> (-18.05%) ⬇️

... and 7 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ericstj ericstj marked this pull request as ready for review June 11, 2026 03:27
Copilot AI review requested due to automatic review settings June 11, 2026 03:27
@ericstj ericstj requested a review from tarekgh June 11, 2026 03:27

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends SentencePieceTokenizer to support Hugging Face JSON-only Unigram tokenizers by adding new public factory APIs that can construct a Unigram tokenizer from either an in-memory vocab list or a tokenizer.json stream, avoiding the current requirement for a SentencePiece .model protobuf.

Changes:

  • Add SentencePieceTokenizer.Create(IEnumerable<(string Piece, float Score)> vocab, ...) for constructing a Unigram tokenizer directly from a vocab list.
  • Add SentencePieceTokenizer.CreateFromTokenizerJson(Stream tokenizerJsonStream, ...) for parsing HF tokenizer.json (Unigram) including vocab, unk_id, precompiled charsmap, and Metaspace settings.
  • Add internal constructors/refactoring to build a SentencePieceUnigramModel from vocab pieces and config values, plus new tests covering these creation paths.
Show a summary per file
File Description
test/Microsoft.ML.Tokenizers.Tests/UnigramTests.cs Adds unit tests for vocab-based and tokenizer.json-based Unigram construction and behavior parity checks.
src/Microsoft.ML.Tokenizers/Model/SentencePieceUnigramModel.cs Adds Unigram model constructors that build vocab/trie from (piece, score) inputs and detect special tokens by name.
src/Microsoft.ML.Tokenizers/Model/SentencePieceTokenizer.cs Adds public factories for vocab and tokenizer.json, plus JSON parsing helpers for normalizer/pre-tokenizer extraction.
src/Microsoft.ML.Tokenizers/Model/SentencePieceBaseModel.cs Adds a new base-model constructor taking explicit config/token IDs instead of ModelProto.

Copilot's findings

  • Files reviewed: 4/4 changed files
  • Comments generated: 4

Comment thread src/Microsoft.ML.Tokenizers/Model/SentencePieceBaseModel.cs
Comment thread src/Microsoft.ML.Tokenizers/Model/SentencePieceTokenizer.cs Outdated
Comment thread src/Microsoft.ML.Tokenizers/Model/SentencePieceTokenizer.cs
Comment thread src/Microsoft.ML.Tokenizers/Model/SentencePieceTokenizer.cs Outdated
@tarekgh

tarekgh commented Jun 11, 2026

Copy link
Copy Markdown
Member

CreateFromTokenizerJson ignores model.byte_fallback

The new JSON parser never reads the byte_fallback flag from the model section. In the protobuf path, SentencePieceBaseModel sets ByteFallback = modelProto.TrainerSpec.ByteFallback (SentencePieceBaseModel.cs:36), but the new Unigram constructor passes byteFallback: false as a hardcoded literal:

: base(addBos, addEos,
       ...,
       addDummyPrefix, escapeWhiteSpaces, treatWhitespaceAsSuffix, byteFallback: false,
       precompiledCharsmap, removeExtraWhitespaces, specialTokens)

Hugging Face Unigram tokenizers serialize this as a top-level model.byte_fallback boolean, for example:

"model": {
  "type": "Unigram",
  "unk_id": 0,
  "byte_fallback": true,
  "vocab": [ ... ]
}

For any model that enables byte fallback, hardcoding false means unknown characters will be emitted as the unknown token instead of being decomposed into <0x00>..<0xFF> byte pieces, which changes both the produced token IDs and the decoded output.

Suggested fix: read the flag in CreateFromTokenizerJson and thread it through to the model constructor:

bool byteFallback = modelElement.TryGetProperty("byte_fallback", out JsonElement bf) && bf.GetBoolean();

defaulting to false when the property is absent, then pass byteFallback instead of the literal.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot's findings

  • Files reviewed: 6/6 changed files
  • Comments generated: 7

Comment thread src/Microsoft.ML.Tokenizers/Model/SentencePieceTokenizer.cs
Comment thread src/Microsoft.ML.Tokenizers/Model/SentencePieceTokenizer.cs
Comment thread src/Microsoft.ML.Tokenizers/Model/SentencePieceTokenizer.cs
Comment thread src/Microsoft.ML.Tokenizers/Model/SentencePieceTokenizer.cs
Comment thread src/Microsoft.ML.Tokenizers/Model/SentencePieceTokenizer.cs
Comment thread src/Microsoft.ML.Tokenizers/Model/SentencePieceTokenizer.cs

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot's findings

  • Files reviewed: 6/6 changed files
  • Comments generated: 3

Comment thread src/Microsoft.ML.Tokenizers/Model/SentencePieceTokenizer.cs
Comment thread src/Microsoft.ML.Tokenizers/Model/SentencePieceTokenizer.cs
Comment thread src/Microsoft.ML.Tokenizers/Model/SentencePieceTokenizer.cs

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot's findings

  • Files reviewed: 6/6 changed files
  • Comments generated: 4

Comment thread src/Microsoft.ML.Tokenizers/Model/SentencePieceTokenizer.cs
Comment thread src/Microsoft.ML.Tokenizers/Model/SentencePieceTokenizer.cs
Comment thread src/Microsoft.ML.Tokenizers/Model/SentencePieceTokenizer.cs
…loader

Harden the tokenizer.json parser so malformed inputs fail with a diagnostic
InvalidDataException instead of an InvalidOperationException/ArgumentException
escaping from CreateFromTokenizerJson:

- added_tokens: require string content and numeric id.
- post_processor TemplateProcessing: require a string SpecialToken.id and a numeric
  special_tokens ids[0].
- RobertaProcessing/BertProcessing cls/sep: check the [token, id] value kinds before
  reading.
- Metaspace pre_tokenizer: require a boolean add_prefix_space and a string replacement.
- Replace normalizer: require an object pattern with a string String/Regex, and wrap
  invalid Regex construction as InvalidDataException.
- Read all normalizer/pre_tokenizer/post_processor 'type' fields through a shared
  GetStringOrNull helper so a non-string type is handled in a controlled way rather
  than throwing.

Adds tests covering the new validation paths.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot's findings

  • Files reviewed: 6/6 changed files
  • Comments generated: 1

Comment thread src/Microsoft.ML.Tokenizers/Model/SentencePieceTokenizer.cs
…nsistency

ResolveTemplateTokenId now verifies that a post_processor special_tokens id maps
back to the referenced token (via added tokens or the vocabulary), mirroring the
RobertaProcessing/BertProcessing affix validation, so an inconsistent tokenizer.json
fails with InvalidDataException instead of emitting an id whose decoded token differs
from the template's token name.

Adds an inconsistent-template-special-token test.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot's findings

  • Files reviewed: 6/6 changed files
  • Comments generated: 3

Comment thread src/Microsoft.ML.Tokenizers/Model/SentencePieceUnigramModel.cs
Comment thread test/Microsoft.ML.Tokenizers.Tests/UnigramTests.cs Outdated
Comment thread test/Microsoft.ML.Tokenizers.Tests/UnigramTests.cs Outdated
CreateFromTokenizerJson returns SentencePieceTokenizer, so the explicit casts in the
new tests were unnecessary.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot's findings

  • Files reviewed: 6/6 changed files
  • Comments generated: 3

- Require a normalizer entry to be a JSON object at the start of Build, so a non-object
  entry in a Sequence array fails with InvalidDataException instead of InvalidOperationException.
- Validate the Precompiled 'precompiled_charsmap' is a string before reading it.
- Validate the Prepend 'prepend' is a string before reading it.

Adds tests for the non-object Sequence entry, non-string Precompiled charsmap, and
non-string Prepend cases.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot's findings

  • Files reviewed: 6/6 changed files
  • Comments generated: 0 new

@ericstj ericstj marked this pull request as ready for review June 18, 2026 17:03
@tarekgh

tarekgh commented Jun 18, 2026

Copy link
Copy Markdown
Member

Overall the changes look good.

A few gaps, since a full tokenizer.json is broader than the Unigram model block:

  1. unk_id is required — HF allows "unk_id": null (spec-legal), which we reject.
  2. Non-Metaspace splitting pre-tokenizers are silently ignored (Punctuation, Split, Digits, ByteLevel, BertPreTokenizer). Riskiest one since it fails silently; consider throwing NotSupportedException on unrecognized pre-tokenizer types, like the normalizer chain already does.
  3. The decoder section is not read — fine for standard configs, ignores non-standard decoder chains.

It would be good to address 1 and 2 in this PR. None block merging for the standard HF Unigram/Metaspace scenario the tests cover. Suggest opening a tracking issue to support the remaining parts (e.g. decoder) when a real model needs them.

@tarekgh tarekgh left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left minor comment, LGTM otherwise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[API] Public way to construct SentencePieceTokenizer (Unigram) from tokenizer.json / in-memory pieces+scores

4 participants