perf(guided-decoding): optimize with async D2H copy and xgrammar v0.2.1#4605
perf(guided-decoding): optimize with async D2H copy and xgrammar v0.2.1#4605windreamer wants to merge 8 commits into
Conversation
9a942b6 to
3fdff4a
Compare
There was a problem hiding this comment.
Pull request overview
This PR optimizes guided decoding performance in TurboMind by upgrading xgrammar and refactoring guided-decoding paths to use batched matcher APIs plus CUDA stream/event orchestration to overlap host-device transfers with GPU work. It also reduces Python-side overhead by reusing a lazily constructed GrammarCompiler and fixes a PyTorch guided-decoding type mismatch introduced by the xgrammar upgrade.
Changes:
- Upgrade xgrammar to v0.2.1 and switch C++ guided decoding to batched matcher APIs (
BatchFillNextTokenBitmask/BatchAcceptToken). - Overlap
output_idsD2H copies with GPU kernels via a secondary CUDA stream and split guided decoding update intoScheduleUpdate+FinishUpdate. - Cache
GrammarCompilerperTurboMindinstance (lazy init) and fix PyTorchaccept_tokento pass a Pythonintvia.item().
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/turbomind/generation/guided_decoding.h | Adds batched matcher + CUDA stream/event members; splits Update into two phases. |
| src/turbomind/generation/guided_decoding.cc | Implements batched xgrammar calls and async D2H overlap using events/streams; adds needs_apply gating. |
| src/turbomind/generation/generation.cc | Integrates ScheduleUpdate/FinishUpdate around AppendTokenIds and stop_criteria to enable overlap. |
| src/turbomind/generation/CMakeLists.txt | Exposes xgrammar/core linkage publicly for guided_decoding consumers. |
| lmdeploy/turbomind/turbomind.py | Introduces lazy-shared GrammarCompiler and removes per-request instantiation. |
| lmdeploy/pytorch/engine/logits_process.py | Passes token id as Python int (.item()) to guided decoding manager. |
| CMakeLists.txt | Bumps FetchContent xgrammar tag to v0.2.1. |
Comments suppressed due to low confidence (1)
src/turbomind/generation/guided_decoding.cc:135
- Similarly,
FinishUpdate()allocatesactive_matchersandactive_token_idsevery step without reserving. Reserving (or persisting these vectors in the phaseData) would reduce per-token allocation overhead, especially for large batch sizes.
// Collect active matchers and their token IDs for batch AcceptToken
std::vector<xgrammar::GrammarMatcher> active_matchers;
std::vector<int32_t> active_token_ids;
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (2)
src/turbomind/generation/guided_decoding.cc:129
FinishUpdate()callsd2h_done_.Sync()on all TP ranks even though only rank 0 performsBatchAcceptToken. On non-zero ranks this host-side wait is pure overhead (and can also introduce unnecessary CPU/GPU synchronization points). Consider moving the sync + matcher update under thetp_group_->rank() == 0branch, or early-returning for non-zero ranks.
if (auto& d = *data_.at(phase); d.active) {
// Wait only for the D2H copy to complete — the main stream's
// AppendTokenIds + stop_criteria may still be executing on GPU.
d2h_done_.Sync();
if (tp_group_->rank() == 0) {
lmdeploy/pytorch/engine/logits_process.py:484
- The guided-decoding
accept_tokencall site changed to pass a Pythonint, but there is no unit test intests/pytorch/engine/test_logits_process.pycovering theguided_decoding_managerintegration path (e.g., thataccept_tokenis invoked with the expected token values/types). Adding a small test with a stubGuidedDecodingManagerwould help prevent regressions when sampling runs on CUDA tensors.
if self.guided_decoding_manager and self.guided_processors:
for i, processor in self.guided_processors.items():
self.guided_decoding_manager.accept_token(processor, result[i].item())
b5a3678 to
49c617f
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
src/turbomind/generation/guided_decoding.cc:139
FinishUpdate()iterates over alld.matchersand calls batchAcceptTokenusingoutput_ids_buf_[i]for every non-terminated matcher. However, only the firstgeneration_sizesequences receive a newly sampled token each step; for the remaining slotsoutput_ids_buf_may contain stale values (since sampling runs withbatch_size = logits.shape(0)). This can advance grammar state incorrectly for sequences that were not generating this step. Limit the loop togeneration_size(saved fromScheduleUpdate()), or gate on the per-requestgeneratingmask.
if (auto& d = *data_.at(phase); d.active && tp_group_->rank() == 0) {
// Wait only for the D2H copy to complete — the main stream's
// AppendTokenIds + stop_criteria may still be executing on GPU.
d2h_done_.Sync();
// Collect active matchers and their token IDs for batch AcceptToken
std::vector<xgrammar::GrammarMatcher> active_matchers;
std::vector<int32_t> active_token_ids;
active_matchers.reserve(d.matchers.size());
active_token_ids.reserve(d.matchers.size());
for (size_t i = 0; i < d.matchers.size(); ++i) {
if (const auto& m = d.matchers[i]; m && !m->IsTerminated()) {
active_matchers.emplace_back(*m);
active_token_ids.emplace_back(output_ids_buf_[i]);
}
FillMask, ScheduleUpdate, and FinishUpdate previously iterated over d.matchers.size() entries, but only the first generation_size (= logits.shape(0)) slots are actively generating. Entries beyond that index contain stale output_ids and unused bitmasks. - FillMask: limit matcher iteration and reserve to gs = logits.shape(0) - ScheduleUpdate: copy only gs output_ids entries for D2H transfer - FinishUpdate: add TensorMap& env param, iterate only over gs slots Fixes review comments on PR InternLM#4605 (3280137130, 3280137198).
FillMask, ScheduleUpdate, and FinishUpdate previously iterated over d.matchers.size() entries, but only the first generation_size (= logits.shape(0)) slots are actively generating. Entries beyond that index contain stale output_ids and unused bitmasks. - FillMask: limit matcher iteration and reserve to gs = logits.shape(0) - ScheduleUpdate: copy only gs output_ids entries for D2H transfer - FinishUpdate: add TensorMap& env param, iterate only over gs slots Fixes review comments on PR InternLM#4605 (3280137130, 3280137198).
ea5c50b to
e24c147
Compare
FillMask, ScheduleUpdate, and FinishUpdate previously iterated over d.matchers.size() entries, but only the first generation_size (= logits.shape(0)) slots are actively generating. Entries beyond that index contain stale output_ids and unused bitmasks. - FillMask: limit matcher iteration and reserve to gs = logits.shape(0) - ScheduleUpdate: copy only gs output_ids entries for D2H transfer - FinishUpdate: add TensorMap& env param, iterate only over gs slots Fixes review comments on PR InternLM#4605 (3280137130, 3280137198).
e24c147 to
33563fe
Compare
FillMask, ScheduleUpdate, and FinishUpdate previously iterated over d.matchers.size() entries, but only the first generation_size (= logits.shape(0)) slots are actively generating. Entries beyond that index contain stale output_ids and unused bitmasks. - FillMask: limit matcher iteration and reserve to gs = logits.shape(0) - ScheduleUpdate: copy only gs output_ids entries for D2H transfer - FinishUpdate: add TensorMap& env param, iterate only over gs slots Fixes review comments on PR InternLM#4605 (3280137130, 3280137198).
33563fe to
ba4efab
Compare
FillMask, ScheduleUpdate, and FinishUpdate previously iterated over d.matchers.size() entries, but only the first generation_size (= logits.shape(0)) slots are actively generating. Entries beyond that index contain stale output_ids and unused bitmasks. - FillMask: limit matcher iteration and reserve to gs = logits.shape(0) - ScheduleUpdate: copy only gs output_ids entries for D2H transfer - FinishUpdate: add TensorMap& env param, iterate only over gs slots Fixes review comments on PR InternLM#4605 (3280137130, 3280137198).
ba4efab to
84a90a2
Compare
FillMask, ScheduleUpdate, and FinishUpdate previously iterated over d.matchers.size() entries, but only the first generation_size (= logits.shape(0)) slots are actively generating. Entries beyond that index contain stale output_ids and unused bitmasks. - FillMask: limit matcher iteration and reserve to gs = logits.shape(0) - ScheduleUpdate: copy only gs output_ids entries for D2H transfer - FinishUpdate: add TensorMap& env param, iterate only over gs slots Fixes review comments on PR InternLM#4605 (3280137130, 3280137198).
84a90a2 to
e17e281
Compare
… CUDA stream Split GuidedDecoding::Update() into ScheduleUpdate() + FinishUpdate() to enable D2H copy of output_ids on a secondary CUDA stream, overlapping with AppendTokenIds and stop_criteria GPU kernels on the main stream. - ScheduleUpdate(): records sampling_done event on main stream, launches async D2H copy on d2h_stream_ (waits for sampling_done first) - FinishUpdate(): syncs on d2h_done event, then runs BatchAcceptToken on CPU - Adds d2h_stream_, sampling_done_, d2h_done_ members (created once in ctor) - Eliminates the blocking cudaStreamSynchronize that previously stalled the CPU between sampling and AcceptToken This is optimization 5 (Plan I): independent CUDA stream for D2H copy parallelism, removing a sync point in the decode step hot path.
FillMask, ScheduleUpdate, and FinishUpdate previously iterated over d.matchers.size() entries, but only the first generation_size (= logits.shape(0)) slots are actively generating. Entries beyond that index contain stale output_ids and unused bitmasks. - FillMask: limit matcher iteration and reserve to gs = logits.shape(0) - ScheduleUpdate: copy only gs output_ids entries for D2H transfer - FinishUpdate: add TensorMap& env param, iterate only over gs slots Fixes review comments on PR InternLM#4605 (3280137130, 3280137198).
…nternal CMake dep
….2.1 API The xgrammar v0.2.1 API changed accept_token signature to require a debug_print boolean argument. This commit: - Adds debug_print=False to processor.accept_token() call in guided_process.py - Ensures token is converted to plain Python int before passing to accept_token in logits_process.py to avoid type mismatches Fixes unit test TypeError in test_mix_guided_matrix for PyTorch engine.
a0f11aa to
3636004
Compare
The batch API (BatchFillNextTokenBitmask, BatchAcceptToken) requires std::vector<GrammarMatcher>*. Creating temporary vectors with emplace_back(*m) makes shallow copies that share the same Impl object via shared_ptr. This leads to: 1. Race conditions if BatchGrammarMatcher uses multi-threading (max_threads > 1) 2. State inconsistency between FillMask and FinishUpdate using different temporary vectors Solution: Revert to direct per-matcher calls: - m->FillNextTokenBitmask(&dlbitmask, i) in FillMask - m->AcceptToken(output_ids_buf_[i]) in FinishUpdate Keep other optimizations: async D2H copy, generation_size limiting, etc. Test results: - Qwen/Qwen3-0.6B: all tests pass - Qwen/Qwen3-VL-2B-Instruct: all tests pass (JSON/regex/json_object)
Motivation
Improve guided decoding performance in TurboMind by reducing synchronization overhead and upgrading to xgrammar v0.2.1 for API compatibility.
Modification
This PR includes the following optimizations and fixes:
Performance Optimizations
Async D2H Copy Overlap (main optimization):
GuidedDecoding::Update()intoScheduleUpdate()+FinishUpdate()to enable D2H copy overlap with GPU workd2h_stream_) for asynchronous output_ids D2H transfersampling_done_,d2h_done_) for stream synchronizationAppendTokenIdsandstop_criteriaGPU kernels on the main streamcudaStreamSynchronizein the decode step hot pathGrammar Compiler Caching:
GrammarCompilershared across all requests inTurboMindModelCorrectness Fixes
FillMask,ScheduleUpdate, andFinishUpdatetogeneration_size(=logits.shape(0))matchers.size()entries, including idle/prefill slots with staleoutput_idsDependency Updates
CMakeLists.txtdebug_print=Falseargument toaccept_token()APIintbefore passing toaccept_token()Build System
xgrammarandcorefromPRIVATEtoPUBLIClinkage inguided_decodinglibraryBC-breaking (Optional)
No backward compatibility issues. The API changes are internal to the TurboMind engine and PyTorch engine's guided decoding implementation.
Use cases (Optional)
This optimization benefits all users of structured generation features:
Example usage:
Checklist
Test Results
All guided decoding tests pass with the new implementation: