Honor model eos_token_id and bound local generation (v0.3.20)#317
Merged
Conversation
Two fixes for runaway local generation with models whose ChatML end token differs from their tokenizer EOS (e.g. dhara-250m: chat ends at <|im_end|>=49154 but tokenizer eos is <|end_of_text|>=1). optillm forced eos to the tokenizer's id and defaulted max_new_tokens to 4096, so such a model never stopped and generated 4096 tokens (~800s at ~5 tok/s) on every call that omitted max_tokens. 1. Resolve EOS from the model's generation_config.eos_token_id (merging the tokenizer eos as a fallback) instead of hardcoding tokenizer.eos_token_id. Applied to both PyTorch generate paths. 2. Make the default max_new_tokens env-configurable via OPTILLM_MAX_TOKENS (default 4096), covering the config builders and the InferenceClient.create() request paths. An explicit request max_tokens still wins. Set OPTILLM_MAX_TOKENS=128 in the CI jobs that run the small test model. Adds unit tests (no model load) for both helpers. Verified end to end: an unbounded request with OPTILLM_MAX_TOKENS=64 stops at 64 tokens instead of 4096. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
93a43d1 to
38034ba
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Running a model whose ChatML end token differs from its tokenizer EOS through optillm's local inference caused runaway generation. Concretely for dhara-250m: its chat template ends at
<|im_end|>(49154) but the tokenizer'seos_tokenis<|end_of_text|>(1).optillm (
inference.py) hardcodedeos_token_id = self.tokenizer.eos_token_id(so it waited for token 1, which chat responses never emit) and defaultedmax_new_tokensto 4096. Result: the model never stopped early and generated the full 4096 tokens (~800s at ~5 tok/s) on every call that didn't sendmax_tokens— including approach-internal calls. Well-behaved models like Qwen2.5 were unaffected because their tokenizereos_tokenis<|im_end|>.This only affects the built-in local inference engine (
OPTILLM_API_KEY=optillm); proxied providers handle stopping themselves.Fixes
1. Honor the model's
generation_config.eos_token_id. New_resolve_eos_token_ids()prefers the model's own generation-config EOS (which chat models set to their turn-end token) and merges the tokenizer EOS as a fallback. Applied to both PyTorch generate paths.2. Env-configurable generation bound.
OPTILLM_MAX_TOKENSoverrides the defaultmax_new_tokens(default stays 4096), so a single generation can be bounded even when the request sends nomax_tokens. An explicit requestmax_tokensstill takes precedence. Set to128in the CI jobs that run the small test model. This is the reliable stop for models like dhara that don't emit their EOS reliably at all.Testing
_resolve_eos_token_idsand_default_max_new_tokensintest_batching.py(unit-tests job).max_tokens) to a dhara server withOPTILLM_MAX_TOKENS=64now returns 64 tokens, finish_reason=stop, ~30s (was 800s+).READMEdocumentsOPTILLM_MAX_TOKENS.