Skip to content

perf(guided-decoding): optimize with async D2H copy and xgrammar v0.2.1#4605

Open
windreamer wants to merge 8 commits into
InternLM:mainfrom
windreamer:feat/guided-decoding-optimization
Open

perf(guided-decoding): optimize with async D2H copy and xgrammar v0.2.1#4605
windreamer wants to merge 8 commits into
InternLM:mainfrom
windreamer:feat/guided-decoding-optimization

Conversation

@windreamer

@windreamer windreamer commented May 21, 2026

Copy link
Copy Markdown
Collaborator

Motivation

Improve guided decoding performance in TurboMind by reducing synchronization overhead and upgrading to xgrammar v0.2.1 for API compatibility.

Modification

This PR includes the following optimizations and fixes:

Performance Optimizations

  1. Async D2H Copy Overlap (main optimization):

    • Split GuidedDecoding::Update() into ScheduleUpdate() + FinishUpdate() to enable D2H copy overlap with GPU work
    • Added secondary CUDA stream (d2h_stream_) for asynchronous output_ids D2H transfer
    • Added CUDA events (sampling_done_, d2h_done_) for stream synchronization
    • D2H copy now overlaps with AppendTokenIds and stop_criteria GPU kernels on the main stream
    • Eliminates a blocking cudaStreamSynchronize in the decode step hot path
  2. Grammar Compiler Caching:

    • Lazy-initialized GrammarCompiler shared across all requests in TurboMindModel
    • Avoids recreating compiler + tokenizer info on every request

Correctness Fixes

  1. Limit Iteration to Active Generation Slots:
    • Limited iteration in FillMask, ScheduleUpdate, and FinishUpdate to generation_size (= logits.shape(0))
    • Previous code iterated over all matchers.size() entries, including idle/prefill slots with stale output_ids
    • Prevents processing stale data beyond active generation slots

Dependency Updates

  1. xgrammar v0.2.1 Upgrade:
    • Updated from v0.1.27 to v0.2.1 in CMakeLists.txt
    • Added required debug_print=False argument to accept_token() API
    • Ensured token conversion to plain Python int before passing to accept_token()

Build System

  1. CMake Dependency Cleanup:
    • Changed xgrammar and core from PRIVATE to PUBLIC linkage in guided_decoding library
    • Ensures proper transitive dependency resolution for downstream targets

BC-breaking (Optional)

No backward compatibility issues. The API changes are internal to the TurboMind engine and PyTorch engine's guided decoding implementation.

Use cases (Optional)

This optimization benefits all users of structured generation features:

  • JSON schema constrained generation
  • Regex pattern constrained generation
  • JSON object generation

Example usage:

from lmdeploy import pipeline, GenerationConfig

pipe = pipeline('Qwen/Qwen2.5-7B-Instruct')
response = pipe(
    'Generate a user profile',
    gen_config=GenerationConfig(
        response_format={'type': 'json_schema', 'json_schema': {...}}
    )
)

Checklist

  1. ✅ Pre-commit or other linting tools are used to fix the potential lint issues.
  2. ✅ The modification is covered by complete unit tests. All tests pass:
    • Qwen/Qwen3-0.6B: all tests pass
    • Qwen/Qwen3-VL-2B-Instruct: all tests pass (JSON schema, regex schema, json_object)
  3. ✅ No dependency on downstream projects with version changes.
  4. ✅ Documentation not required for internal optimization changes.

Test Results

All guided decoding tests pass with the new implementation:

  • JSON schema constrained generation: ✅
  • Regex pattern constrained generation: ✅
  • JSON object generation: ✅

@windreamer windreamer force-pushed the feat/guided-decoding-optimization branch from 9a942b6 to 3fdff4a Compare May 21, 2026 04:23
@windreamer windreamer changed the title perf: speed up guided decoding with xgrammar new version and batched update perf: optimize guided decoding with xgrammar upgrade, batched API, and async D2H overlap May 21, 2026
@windreamer windreamer marked this pull request as ready for review May 21, 2026 08:15
Copilot AI review requested due to automatic review settings May 21, 2026 08:15

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes guided decoding performance in TurboMind by upgrading xgrammar and refactoring guided-decoding paths to use batched matcher APIs plus CUDA stream/event orchestration to overlap host-device transfers with GPU work. It also reduces Python-side overhead by reusing a lazily constructed GrammarCompiler and fixes a PyTorch guided-decoding type mismatch introduced by the xgrammar upgrade.

Changes:

  • Upgrade xgrammar to v0.2.1 and switch C++ guided decoding to batched matcher APIs (BatchFillNextTokenBitmask / BatchAcceptToken).
  • Overlap output_ids D2H copies with GPU kernels via a secondary CUDA stream and split guided decoding update into ScheduleUpdate + FinishUpdate.
  • Cache GrammarCompiler per TurboMind instance (lazy init) and fix PyTorch accept_token to pass a Python int via .item().

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/turbomind/generation/guided_decoding.h Adds batched matcher + CUDA stream/event members; splits Update into two phases.
src/turbomind/generation/guided_decoding.cc Implements batched xgrammar calls and async D2H overlap using events/streams; adds needs_apply gating.
src/turbomind/generation/generation.cc Integrates ScheduleUpdate/FinishUpdate around AppendTokenIds and stop_criteria to enable overlap.
src/turbomind/generation/CMakeLists.txt Exposes xgrammar/core linkage publicly for guided_decoding consumers.
lmdeploy/turbomind/turbomind.py Introduces lazy-shared GrammarCompiler and removes per-request instantiation.
lmdeploy/pytorch/engine/logits_process.py Passes token id as Python int (.item()) to guided decoding manager.
CMakeLists.txt Bumps FetchContent xgrammar tag to v0.2.1.
Comments suppressed due to low confidence (1)

src/turbomind/generation/guided_decoding.cc:135

  • Similarly, FinishUpdate() allocates active_matchers and active_token_ids every step without reserving. Reserving (or persisting these vectors in the phase Data) would reduce per-token allocation overhead, especially for large batch sizes.
            // Collect active matchers and their token IDs for batch AcceptToken
            std::vector<xgrammar::GrammarMatcher> active_matchers;
            std::vector<int32_t>                  active_token_ids;


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/turbomind/generation/guided_decoding.cc Outdated
Comment thread src/turbomind/generation/guided_decoding.cc Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (2)

src/turbomind/generation/guided_decoding.cc:129

  • FinishUpdate() calls d2h_done_.Sync() on all TP ranks even though only rank 0 performs BatchAcceptToken. On non-zero ranks this host-side wait is pure overhead (and can also introduce unnecessary CPU/GPU synchronization points). Consider moving the sync + matcher update under the tp_group_->rank() == 0 branch, or early-returning for non-zero ranks.
    if (auto& d = *data_.at(phase); d.active) {
        // Wait only for the D2H copy to complete — the main stream's
        // AppendTokenIds + stop_criteria may still be executing on GPU.
        d2h_done_.Sync();

        if (tp_group_->rank() == 0) {

lmdeploy/pytorch/engine/logits_process.py:484

  • The guided-decoding accept_token call site changed to pass a Python int, but there is no unit test in tests/pytorch/engine/test_logits_process.py covering the guided_decoding_manager integration path (e.g., that accept_token is invoked with the expected token values/types). Adding a small test with a stub GuidedDecodingManager would help prevent regressions when sampling runs on CUDA tensors.
        if self.guided_decoding_manager and self.guided_processors:
            for i, processor in self.guided_processors.items():
                self.guided_decoding_manager.accept_token(processor, result[i].item())

Comment thread src/turbomind/generation/guided_decoding.cc
Comment thread lmdeploy/pytorch/engine/logits_process.py Outdated
@windreamer windreamer force-pushed the feat/guided-decoding-optimization branch from b5a3678 to 49c617f Compare May 21, 2026 09:15
@windreamer windreamer requested a review from Copilot May 21, 2026 09:29

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

src/turbomind/generation/guided_decoding.cc:139

  • FinishUpdate() iterates over all d.matchers and calls batch AcceptToken using output_ids_buf_[i] for every non-terminated matcher. However, only the first generation_size sequences receive a newly sampled token each step; for the remaining slots output_ids_buf_ may contain stale values (since sampling runs with batch_size = logits.shape(0)). This can advance grammar state incorrectly for sequences that were not generating this step. Limit the loop to generation_size (saved from ScheduleUpdate()), or gate on the per-request generating mask.
    if (auto& d = *data_.at(phase); d.active && tp_group_->rank() == 0) {
        // Wait only for the D2H copy to complete — the main stream's
        // AppendTokenIds + stop_criteria may still be executing on GPU.
        d2h_done_.Sync();

        // Collect active matchers and their token IDs for batch AcceptToken
        std::vector<xgrammar::GrammarMatcher> active_matchers;
        std::vector<int32_t>                  active_token_ids;
        active_matchers.reserve(d.matchers.size());
        active_token_ids.reserve(d.matchers.size());

        for (size_t i = 0; i < d.matchers.size(); ++i) {
            if (const auto& m = d.matchers[i]; m && !m->IsTerminated()) {
                active_matchers.emplace_back(*m);
                active_token_ids.emplace_back(output_ids_buf_[i]);
            }

Comment thread src/turbomind/generation/guided_decoding.cc
Comment thread src/turbomind/generation/guided_decoding.cc Outdated
windreamer added a commit to windreamer/lmdeploy that referenced this pull request May 21, 2026
FillMask, ScheduleUpdate, and FinishUpdate previously iterated over
d.matchers.size() entries, but only the first generation_size
(= logits.shape(0)) slots are actively generating. Entries beyond
that index contain stale output_ids and unused bitmasks.

- FillMask: limit matcher iteration and reserve to gs = logits.shape(0)
- ScheduleUpdate: copy only gs output_ids entries for D2H transfer
- FinishUpdate: add TensorMap& env param, iterate only over gs slots

Fixes review comments on PR InternLM#4605 (3280137130, 3280137198).
@windreamer windreamer requested a review from Copilot May 21, 2026 10:14

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Comment thread src/turbomind/generation/guided_decoding.cc Outdated
Comment thread src/turbomind/generation/CMakeLists.txt Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.

windreamer added a commit to windreamer/lmdeploy that referenced this pull request May 22, 2026
FillMask, ScheduleUpdate, and FinishUpdate previously iterated over
d.matchers.size() entries, but only the first generation_size
(= logits.shape(0)) slots are actively generating. Entries beyond
that index contain stale output_ids and unused bitmasks.

- FillMask: limit matcher iteration and reserve to gs = logits.shape(0)
- ScheduleUpdate: copy only gs output_ids entries for D2H transfer
- FinishUpdate: add TensorMap& env param, iterate only over gs slots

Fixes review comments on PR InternLM#4605 (3280137130, 3280137198).
@windreamer windreamer force-pushed the feat/guided-decoding-optimization branch from ea5c50b to e24c147 Compare May 22, 2026 01:39
windreamer added a commit to windreamer/lmdeploy that referenced this pull request Jun 5, 2026
FillMask, ScheduleUpdate, and FinishUpdate previously iterated over
d.matchers.size() entries, but only the first generation_size
(= logits.shape(0)) slots are actively generating. Entries beyond
that index contain stale output_ids and unused bitmasks.

- FillMask: limit matcher iteration and reserve to gs = logits.shape(0)
- ScheduleUpdate: copy only gs output_ids entries for D2H transfer
- FinishUpdate: add TensorMap& env param, iterate only over gs slots

Fixes review comments on PR InternLM#4605 (3280137130, 3280137198).
@windreamer windreamer force-pushed the feat/guided-decoding-optimization branch from e24c147 to 33563fe Compare June 5, 2026 02:39
windreamer added a commit to windreamer/lmdeploy that referenced this pull request Jun 5, 2026
FillMask, ScheduleUpdate, and FinishUpdate previously iterated over
d.matchers.size() entries, but only the first generation_size
(= logits.shape(0)) slots are actively generating. Entries beyond
that index contain stale output_ids and unused bitmasks.

- FillMask: limit matcher iteration and reserve to gs = logits.shape(0)
- ScheduleUpdate: copy only gs output_ids entries for D2H transfer
- FinishUpdate: add TensorMap& env param, iterate only over gs slots

Fixes review comments on PR InternLM#4605 (3280137130, 3280137198).
@windreamer windreamer force-pushed the feat/guided-decoding-optimization branch from 33563fe to ba4efab Compare June 5, 2026 02:45
@windreamer windreamer marked this pull request as draft June 9, 2026 01:12
windreamer added a commit to windreamer/lmdeploy that referenced this pull request Jun 9, 2026
FillMask, ScheduleUpdate, and FinishUpdate previously iterated over
d.matchers.size() entries, but only the first generation_size
(= logits.shape(0)) slots are actively generating. Entries beyond
that index contain stale output_ids and unused bitmasks.

- FillMask: limit matcher iteration and reserve to gs = logits.shape(0)
- ScheduleUpdate: copy only gs output_ids entries for D2H transfer
- FinishUpdate: add TensorMap& env param, iterate only over gs slots

Fixes review comments on PR InternLM#4605 (3280137130, 3280137198).
@windreamer windreamer force-pushed the feat/guided-decoding-optimization branch from ba4efab to 84a90a2 Compare June 9, 2026 02:41
@windreamer windreamer marked this pull request as ready for review June 9, 2026 04:00
@lvhan028 lvhan028 requested review from irexyc and lzhangzz June 25, 2026 03:01
windreamer added a commit to windreamer/lmdeploy that referenced this pull request Jul 1, 2026
FillMask, ScheduleUpdate, and FinishUpdate previously iterated over
d.matchers.size() entries, but only the first generation_size
(= logits.shape(0)) slots are actively generating. Entries beyond
that index contain stale output_ids and unused bitmasks.

- FillMask: limit matcher iteration and reserve to gs = logits.shape(0)
- ScheduleUpdate: copy only gs output_ids entries for D2H transfer
- FinishUpdate: add TensorMap& env param, iterate only over gs slots

Fixes review comments on PR InternLM#4605 (3280137130, 3280137198).
@windreamer windreamer force-pushed the feat/guided-decoding-optimization branch from 84a90a2 to e17e281 Compare July 1, 2026 07:18
@windreamer windreamer marked this pull request as draft July 2, 2026 06:57
… CUDA stream

Split GuidedDecoding::Update() into ScheduleUpdate() + FinishUpdate()
to enable D2H copy of output_ids on a secondary CUDA stream, overlapping
with AppendTokenIds and stop_criteria GPU kernels on the main stream.

- ScheduleUpdate(): records sampling_done event on main stream, launches
  async D2H copy on d2h_stream_ (waits for sampling_done first)
- FinishUpdate(): syncs on d2h_done event, then runs BatchAcceptToken on CPU
- Adds d2h_stream_, sampling_done_, d2h_done_ members (created once in ctor)
- Eliminates the blocking cudaStreamSynchronize that previously stalled the
  CPU between sampling and AcceptToken

This is optimization 5 (Plan I): independent CUDA stream for D2H copy
parallelism, removing a sync point in the decode step hot path.
FillMask, ScheduleUpdate, and FinishUpdate previously iterated over
d.matchers.size() entries, but only the first generation_size
(= logits.shape(0)) slots are actively generating. Entries beyond
that index contain stale output_ids and unused bitmasks.

- FillMask: limit matcher iteration and reserve to gs = logits.shape(0)
- ScheduleUpdate: copy only gs output_ids entries for D2H transfer
- FinishUpdate: add TensorMap& env param, iterate only over gs slots

Fixes review comments on PR InternLM#4605 (3280137130, 3280137198).
….2.1 API

The xgrammar v0.2.1 API changed accept_token signature to require a
debug_print boolean argument. This commit:
- Adds debug_print=False to processor.accept_token() call in guided_process.py
- Ensures token is converted to plain Python int before passing to accept_token
  in logits_process.py to avoid type mismatches

Fixes unit test TypeError in test_mix_guided_matrix for PyTorch engine.
@windreamer windreamer force-pushed the feat/guided-decoding-optimization branch from a0f11aa to 3636004 Compare July 3, 2026 07:03
@windreamer windreamer marked this pull request as ready for review July 3, 2026 07:03
The batch API (BatchFillNextTokenBitmask, BatchAcceptToken) requires std::vector<GrammarMatcher>*.
Creating temporary vectors with emplace_back(*m) makes shallow copies that share the
same Impl object via shared_ptr. This leads to:

1. Race conditions if BatchGrammarMatcher uses multi-threading (max_threads > 1)
2. State inconsistency between FillMask and FinishUpdate using different temporary vectors

Solution: Revert to direct per-matcher calls:
- m->FillNextTokenBitmask(&dlbitmask, i) in FillMask
- m->AcceptToken(output_ids_buf_[i]) in FinishUpdate

Keep other optimizations: async D2H copy, generation_size limiting, etc.

Test results:
- Qwen/Qwen3-0.6B: all tests pass
- Qwen/Qwen3-VL-2B-Instruct: all tests pass (JSON/regex/json_object)
@windreamer windreamer changed the title perf: optimize guided decoding with xgrammar upgrade, batched API, and async D2H overlap fix: revert batch API in guided decoding to avoid matcher copy issues Jul 3, 2026
@windreamer windreamer changed the title fix: revert batch API in guided decoding to avoid matcher copy issues fix(guided-decoding): revert batch API to avoid matcher copy issues, keep async optimizations Jul 3, 2026
@windreamer windreamer changed the title fix(guided-decoding): revert batch API to avoid matcher copy issues, keep async optimizations perf(guided-decoding): optimize with async D2H copy and xgrammar v0.2.1 Jul 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants