Honor model eos_token_id and bound local generation (v0.3.20) by codelion · Pull Request #317 · algorithmicsuperintelligence/optillm

codelion · 2026-07-05T03:50:14Z

Problem

Running a model whose ChatML end token differs from its tokenizer EOS through optillm's local inference caused runaway generation. Concretely for dhara-250m: its chat template ends at <|im_end|> (49154) but the tokenizer's eos_token is <|end_of_text|> (1).

optillm (inference.py) hardcoded eos_token_id = self.tokenizer.eos_token_id (so it waited for token 1, which chat responses never emit) and defaulted max_new_tokens to 4096. Result: the model never stopped early and generated the full 4096 tokens (~800s at ~5 tok/s) on every call that didn't send max_tokens — including approach-internal calls. Well-behaved models like Qwen2.5 were unaffected because their tokenizer eos_token is <|im_end|>.

This only affects the built-in local inference engine (OPTILLM_API_KEY=optillm); proxied providers handle stopping themselves.

Fixes

1. Honor the model's generation_config.eos_token_id. New _resolve_eos_token_ids() prefers the model's own generation-config EOS (which chat models set to their turn-end token) and merges the tokenizer EOS as a fallback. Applied to both PyTorch generate paths.

2. Env-configurable generation bound. OPTILLM_MAX_TOKENS overrides the default max_new_tokens (default stays 4096), so a single generation can be bounded even when the request sends no max_tokens. An explicit request max_tokens still takes precedence. Set to 128 in the CI jobs that run the small test model. This is the reliable stop for models like dhara that don't emit their EOS reliably at all.

Testing

New unit tests (no model load) for _resolve_eos_token_ids and _default_max_new_tokens in test_batching.py (unit-tests job).
End-to-end: an unbounded request (no max_tokens) to a dhara server with OPTILLM_MAX_TOKENS=64 now returns 64 tokens, finish_reason=stop, ~30s (was 800s+).
README documents OPTILLM_MAX_TOKENS.

Two fixes for runaway local generation with models whose ChatML end token differs from their tokenizer EOS (e.g. dhara-250m: chat ends at <|im_end|>=49154 but tokenizer eos is <|end_of_text|>=1). optillm forced eos to the tokenizer's id and defaulted max_new_tokens to 4096, so such a model never stopped and generated 4096 tokens (~800s at ~5 tok/s) on every call that omitted max_tokens. 1. Resolve EOS from the model's generation_config.eos_token_id (merging the tokenizer eos as a fallback) instead of hardcoding tokenizer.eos_token_id. Applied to both PyTorch generate paths. 2. Make the default max_new_tokens env-configurable via OPTILLM_MAX_TOKENS (default 4096), covering the config builders and the InferenceClient.create() request paths. An explicit request max_tokens still wins. Set OPTILLM_MAX_TOKENS=128 in the CI jobs that run the small test model. Adds unit tests (no model load) for both helpers. Verified end to end: an unbounded request with OPTILLM_MAX_TOKENS=64 stops at 64 tokens instead of 4096. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

codelion force-pushed the fix/eos-generation-config branch from 93a43d1 to 38034ba Compare July 5, 2026 03:52

codelion merged commit 205a037 into main Jul 5, 2026
6 checks passed

codelion deleted the fix/eos-generation-config branch July 5, 2026 04:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Honor model eos_token_id and bound local generation (v0.3.20)#317

Honor model eos_token_id and bound local generation (v0.3.20)#317
codelion merged 1 commit into
mainfrom
fix/eos-generation-config

codelion commented Jul 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

codelion commented Jul 5, 2026

Problem

Fixes

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant