perf(guided-decoding): optimize with async D2H copy and xgrammar v0.2.1 by windreamer · Pull Request #4605 · InternLM/lmdeploy

windreamer · 2026-05-21T03:58:56Z

Motivation

Improve guided decoding performance in TurboMind by reducing synchronization overhead and upgrading to xgrammar v0.2.1 for API compatibility.

Modification

This PR includes the following optimizations and fixes:

Performance Optimizations

Async D2H Copy Overlap (main optimization):
- Split GuidedDecoding::Update() into ScheduleUpdate() + FinishUpdate() to enable D2H copy overlap with GPU work
- Added secondary CUDA stream (d2h_stream_) for asynchronous output_ids D2H transfer
- Added CUDA events (sampling_done_, d2h_done_) for stream synchronization
- D2H copy now overlaps with AppendTokenIds and stop_criteria GPU kernels on the main stream
- Eliminates a blocking cudaStreamSynchronize in the decode step hot path
Grammar Compiler Caching:
- Lazy-initialized GrammarCompiler shared across all requests in TurboMindModel
- Avoids recreating compiler + tokenizer info on every request

Correctness Fixes

Limit Iteration to Active Generation Slots:
- Limited iteration in FillMask, ScheduleUpdate, and FinishUpdate to generation_size (= logits.shape(0))
- Previous code iterated over all matchers.size() entries, including idle/prefill slots with stale output_ids
- Prevents processing stale data beyond active generation slots

Dependency Updates

xgrammar v0.2.1 Upgrade:
- Updated from v0.1.27 to v0.2.1 in CMakeLists.txt
- Added required debug_print=False argument to accept_token() API
- Ensured token conversion to plain Python int before passing to accept_token()

Build System

CMake Dependency Cleanup:
- Changed xgrammar and core from PRIVATE to PUBLIC linkage in guided_decoding library
- Ensures proper transitive dependency resolution for downstream targets

BC-breaking (Optional)

No backward compatibility issues. The API changes are internal to the TurboMind engine and PyTorch engine's guided decoding implementation.

Use cases (Optional)

This optimization benefits all users of structured generation features:

JSON schema constrained generation
Regex pattern constrained generation
JSON object generation

Example usage:

from lmdeploy import pipeline, GenerationConfig

pipe = pipeline('Qwen/Qwen2.5-7B-Instruct')
response = pipe(
    'Generate a user profile',
    gen_config=GenerationConfig(
        response_format={'type': 'json_schema', 'json_schema': {...}}
    )
)

Checklist

✅ Pre-commit or other linting tools are used to fix the potential lint issues.
✅ The modification is covered by complete unit tests. All tests pass:
- Qwen/Qwen3-0.6B: all tests pass
- Qwen/Qwen3-VL-2B-Instruct: all tests pass (JSON schema, regex schema, json_object)
✅ No dependency on downstream projects with version changes.
✅ Documentation not required for internal optimization changes.

Test Results

All guided decoding tests pass with the new implementation:

JSON schema constrained generation: ✅
Regex pattern constrained generation: ✅
JSON object generation: ✅

Copilot

Pull request overview

This PR optimizes guided decoding performance in TurboMind by upgrading xgrammar and refactoring guided-decoding paths to use batched matcher APIs plus CUDA stream/event orchestration to overlap host-device transfers with GPU work. It also reduces Python-side overhead by reusing a lazily constructed GrammarCompiler and fixes a PyTorch guided-decoding type mismatch introduced by the xgrammar upgrade.

Changes:

Upgrade xgrammar to v0.2.1 and switch C++ guided decoding to batched matcher APIs (BatchFillNextTokenBitmask / BatchAcceptToken).
Overlap output_ids D2H copies with GPU kernels via a secondary CUDA stream and split guided decoding update into ScheduleUpdate + FinishUpdate.
Cache GrammarCompiler per TurboMind instance (lazy init) and fix PyTorch accept_token to pass a Python int via .item().

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
src/turbomind/generation/guided_decoding.h	Adds batched matcher + CUDA stream/event members; splits `Update` into two phases.
src/turbomind/generation/guided_decoding.cc	Implements batched xgrammar calls and async D2H overlap using events/streams; adds `needs_apply` gating.
src/turbomind/generation/generation.cc	Integrates `ScheduleUpdate`/`FinishUpdate` around `AppendTokenIds` and `stop_criteria` to enable overlap.
src/turbomind/generation/CMakeLists.txt	Exposes xgrammar/core linkage publicly for guided_decoding consumers.
lmdeploy/turbomind/turbomind.py	Introduces lazy-shared `GrammarCompiler` and removes per-request instantiation.
lmdeploy/pytorch/engine/logits_process.py	Passes token id as Python int (`.item()`) to guided decoding manager.
CMakeLists.txt	Bumps FetchContent xgrammar tag to v0.2.1.

Comments suppressed due to low confidence (1)

src/turbomind/generation/guided_decoding.cc:135

Similarly, FinishUpdate() allocates active_matchers and active_token_ids every step without reserving. Reserving (or persisting these vectors in the phase Data) would reduce per-token allocation overhead, especially for large batch sizes.

            // Collect active matchers and their token IDs for batch AcceptToken
            std::vector<xgrammar::GrammarMatcher> active_matchers;
            std::vector<int32_t>                  active_token_ids;

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (2)

src/turbomind/generation/guided_decoding.cc:129

FinishUpdate() calls d2h_done_.Sync() on all TP ranks even though only rank 0 performs BatchAcceptToken. On non-zero ranks this host-side wait is pure overhead (and can also introduce unnecessary CPU/GPU synchronization points). Consider moving the sync + matcher update under the tp_group_->rank() == 0 branch, or early-returning for non-zero ranks.

    if (auto& d = *data_.at(phase); d.active) {
        // Wait only for the D2H copy to complete — the main stream's
        // AppendTokenIds + stop_criteria may still be executing on GPU.
        d2h_done_.Sync();

        if (tp_group_->rank() == 0) {

lmdeploy/pytorch/engine/logits_process.py:484

The guided-decoding accept_token call site changed to pass a Python int, but there is no unit test in tests/pytorch/engine/test_logits_process.py covering the guided_decoding_manager integration path (e.g., that accept_token is invoked with the expected token values/types). Adding a small test with a stub GuidedDecodingManager would help prevent regressions when sampling runs on CUDA tensors.

        if self.guided_decoding_manager and self.guided_processors:
            for i, processor in self.guided_processors.items():
                self.guided_decoding_manager.accept_token(processor, result[i].item())

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

src/turbomind/generation/guided_decoding.cc:139

FinishUpdate() iterates over all d.matchers and calls batch AcceptToken using output_ids_buf_[i] for every non-terminated matcher. However, only the first generation_size sequences receive a newly sampled token each step; for the remaining slots output_ids_buf_ may contain stale values (since sampling runs with batch_size = logits.shape(0)). This can advance grammar state incorrectly for sequences that were not generating this step. Limit the loop to generation_size (saved from ScheduleUpdate()), or gate on the per-request generating mask.

    if (auto& d = *data_.at(phase); d.active && tp_group_->rank() == 0) {
        // Wait only for the D2H copy to complete — the main stream's
        // AppendTokenIds + stop_criteria may still be executing on GPU.
        d2h_done_.Sync();

        // Collect active matchers and their token IDs for batch AcceptToken
        std::vector<xgrammar::GrammarMatcher> active_matchers;
        std::vector<int32_t>                  active_token_ids;
        active_matchers.reserve(d.matchers.size());
        active_token_ids.reserve(d.matchers.size());

        for (size_t i = 0; i < d.matchers.size(); ++i) {
            if (const auto& m = d.matchers[i]; m && !m->IsTerminated()) {
                active_matchers.emplace_back(*m);
                active_token_ids.emplace_back(output_ids_buf_[i]);
            }

FillMask, ScheduleUpdate, and FinishUpdate previously iterated over d.matchers.size() entries, but only the first generation_size (= logits.shape(0)) slots are actively generating. Entries beyond that index contain stale output_ids and unused bitmasks. - FillMask: limit matcher iteration and reserve to gs = logits.shape(0) - ScheduleUpdate: copy only gs output_ids entries for D2H transfer - FinishUpdate: add TensorMap& env param, iterate only over gs slots Fixes review comments on PR InternLM#4605 (3280137130, 3280137198).

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.

FillMask, ScheduleUpdate, and FinishUpdate previously iterated over d.matchers.size() entries, but only the first generation_size (= logits.shape(0)) slots are actively generating. Entries beyond that index contain stale output_ids and unused bitmasks. - FillMask: limit matcher iteration and reserve to gs = logits.shape(0) - ScheduleUpdate: copy only gs output_ids entries for D2H transfer - FinishUpdate: add TensorMap& env param, iterate only over gs slots Fixes review comments on PR InternLM#4605 (3280137130, 3280137198).

…update

… CUDA stream Split GuidedDecoding::Update() into ScheduleUpdate() + FinishUpdate() to enable D2H copy of output_ids on a secondary CUDA stream, overlapping with AppendTokenIds and stop_criteria GPU kernels on the main stream. - ScheduleUpdate(): records sampling_done event on main stream, launches async D2H copy on d2h_stream_ (waits for sampling_done first) - FinishUpdate(): syncs on d2h_done event, then runs BatchAcceptToken on CPU - Adds d2h_stream_, sampling_done_, d2h_done_ members (created once in ctor) - Eliminates the blocking cudaStreamSynchronize that previously stalled the CPU between sampling and AcceptToken This is optimization 5 (Plan I): independent CUDA stream for D2H copy parallelism, removing a sync point in the decode step hot path.

…lace_back

FillMask, ScheduleUpdate, and FinishUpdate previously iterated over d.matchers.size() entries, but only the first generation_size (= logits.shape(0)) slots are actively generating. Entries beyond that index contain stale output_ids and unused bitmasks. - FillMask: limit matcher iteration and reserve to gs = logits.shape(0) - ScheduleUpdate: copy only gs output_ids entries for D2H transfer - FinishUpdate: add TensorMap& env param, iterate only over gs slots Fixes review comments on PR InternLM#4605 (3280137130, 3280137198).

…nternal CMake dep

….2.1 API The xgrammar v0.2.1 API changed accept_token signature to require a debug_print boolean argument. This commit: - Adds debug_print=False to processor.accept_token() call in guided_process.py - Ensures token is converted to plain Python int before passing to accept_token in logits_process.py to avoid type mismatches Fixes unit test TypeError in test_mix_guided_matrix for PyTorch engine.

The batch API (BatchFillNextTokenBitmask, BatchAcceptToken) requires std::vector<GrammarMatcher>*. Creating temporary vectors with emplace_back(*m) makes shallow copies that share the same Impl object via shared_ptr. This leads to: 1. Race conditions if BatchGrammarMatcher uses multi-threading (max_threads > 1) 2. State inconsistency between FillMask and FinishUpdate using different temporary vectors Solution: Revert to direct per-matcher calls: - m->FillNextTokenBitmask(&dlbitmask, i) in FillMask - m->AcceptToken(output_ids_buf_[i]) in FinishUpdate Keep other optimizations: async D2H copy, generation_size limiting, etc. Test results: - Qwen/Qwen3-0.6B: all tests pass - Qwen/Qwen3-VL-2B-Instruct: all tests pass (JSON/regex/json_object)

windreamer force-pushed the feat/guided-decoding-optimization branch from 9a942b6 to 3fdff4a Compare May 21, 2026 04:23

windreamer changed the title ~~perf: speed up guided decoding with xgrammar new version and batched update~~ perf: optimize guided decoding with xgrammar upgrade, batched API, and async D2H overlap May 21, 2026

windreamer marked this pull request as ready for review May 21, 2026 08:15

Copilot AI review requested due to automatic review settings May 21, 2026 08:15

Copilot started reviewing on behalf of windreamer May 21, 2026 08:15 View session