ds4_agent: fix rocm support#383
Open
mcmalayalam wants to merge 21 commits into
Open
Conversation
Add prompt_ms, prompt_per_second, predicted_ms, predicted_per_second timing fields to streaming and non-streaming OpenAI API responses. - Define server_timings struct with prompt/decode timing fields - Extend server_prefill_progress with prompt_ms/prompt_per_second - Store final prefill timing in server_progress_cb() - Add append_openai_timings_json() helper - Thread timings through sse_usage_chunk, sse_done, sse_chat_finish, openai_sse_finish_live, and final_response - Capture prefill timing after sync completes in generate_job() - Compute decode timing after decode loop and build timings struct - Emit timings in streaming SSE usage chunk and non-streaming JSON body - Add test_openai_final_response_reports_timings() for non-streaming - Update test_openai_stream_usage_reports_cache_details to verify timings - Fix existing openai_sse_finish_live test calls to pass NULL timings
Track routed-expert cache entries referenced by encoded Metal command buffers and prevent cache eviction or buffer reuse until the owning command buffers have completed. This preserves layer-batched decode while avoiding stale buffer captures under SSD-streaming cache pressure. Add a focused Metal SSD-streaming cache-pressure regression for issue antirez#384.
Avoid eager ROCm model copies for range/span startup. Distributed layer slices can be sparse, so materializing the full mapped image or the min..max envelope can allocate far more memory than the selected tensors need. Keep the map metadata and let accelerator_cache_model_tensors() prepare the exact tensor spans instead.\n\nFixes antirez#387.
…ants)
GGUFs whose routed experts are not uniformly quantized across layers (e.g.
IQ2_XXS/Q2_K everywhere with a few layers upcast to Q4_K via --tensor-type)
failed every request under --ssd-streaming: the batch-selected-addr prefill
maps layers with the decode-static span set (no exps tensors), the Metal
encoder's IQ2-only gate sends non-IQ2 layers to the full-tensor
wrap_model_range fallback, and no installed view covers the range
("Metal model range ... is not covered by mapped model views",
"gpu layer N ffn batch encode failed"). Latently, the expert cache's
single-size-class slab allocator would also be poisoned by off-size
experts (last-writer-wins note_expert_size; budget distortion; potential
slab-reuse starvation after an mlock cap).
Fix, in two pieces:
- Detect boosted layers (per-expert bytes != the slab class derived from
the first routed layer; weights_streaming_layer_experts_uniform) and
include their ffn_{gate,up,down}_exps tensors in the decode-span
builders (model_map_span_vec_include_layer_decode), so the existing
wrap_model_range / wrap_model_exact_range fallbacks find their ranges
covered for both prefill and decode.
- Pre-seed the expert-cache slab size class at startup
(ds4_gpu_set_streaming_expert_cache_expert_bytes; no-op stubs for
CUDA/ROCm) and make note_expert_size freeze+reject instead of
last-writer-wins: off-size layers deterministically bypass the cache
and use the mapped-view path. Startup logs the uniform/boosted split
and warns if the majority of routed layers are off-class.
Uniform models are byte-identical by construction (helper returns true
for every layer -> no extra spans; the pre-seed equals what the first
note call would have written) and verified by greedy replay of recorded
benchmark answers. Validated on DeepSeek-V4-Flash 97GB 6-layer-Q4_K
boost on a 64GB machine (suite of code benchmarks, 0 transport errors);
diagnosis confirmed end-to-end via the legacy-path env escape on the
unpatched binary. Details, validation table and reviewer notes in
STREAMING_MIXED_PRECISION.md.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
(cherry picked from commit fefa426)
Share the routed expert byte-size calculation between the base streaming cache sizing and per-layer mixed-precision checks, preserving overflow handling in both paths. Remove the standalone mixed-streaming design note, since this behavior is an implementation detail rather than user documentation.
Document the release validation pass across Metal, CUDA, ROCm, SSD streaming, distributed inference, disk KV cache, server APIs, and ds4-agent. Include the mixed-quant Flash SSD streaming regression check for boosted Q4 routed-expert layers.
Add ROCm SSD streaming support for PRO/Flash paths, including selected expert caching, overlapped reads, full-layer streaming prefill, resident cache lookups, and ROCm cache reuse handling. Add the corresponding CUDA SSD streaming expert cache backend and shared graph integration, with bounded cache allocation and progress reporting.\n\nThis squashes the rocm-streaming branch work on top of current main.
Track live Metal tensor handles under a mutex so stale or duplicate opaque tensor frees do not hand an already-released Objective-C object to ARC. Serialize the runtime tensor allocation counters through the same lock, including memory reports and cleanup. Closes antirez#404
Do not default ROCm builds to an 8192-token prefill workspace for long Flash prompts. On Strix Halo, full IQ2 weight caching plus that workspace can exhaust HSA allocation space before generation starts; the normal Flash default of 4096 leaves enough room while preserving explicit --prefill-chunk overrides. Closes antirez#387
Let ROCm tensor range lookup use existing prepared ranges and fd-backed caching before falling back to uncached copies when a request comes from a non-current model mmap. This lets the main GGUF and the MTP GGUF coexist without forcing the main model through the unlimited copy path. When a second model mmap is registered, disable the optional expanded Q8/F16 cache on ROCm. The cache is only an acceleration path and can consume the memory margin needed for MTP session/context tensors on Strix Halo UMA systems; normal Q8 kernels remain available. Closes antirez#425
Guard dynamic buffer growth against capacity wraparound, normalize malformed surrogate escapes to replacement characters without consuming unrelated escapes, and add a tool-heavy OpenCode-shaped parser stress test for issue antirez#405. Closes antirez#405.
Use a const pointer for stop-list substring matches so GCC/ROCm builds remain warning-clean.
Two model-free correctness fixes in the answer grader, locked by new
--self-test-extractors cases (no model/GPU needed):
- find_integer_answer: the answer-marker path took the first integer in
the window, so an answer line that shows its arithmetic
("Answer: m+n = 256+37 = 293") was graded as the first summand (256)
instead of the stated total (293) -> false negative. Reachable on the
many embedded AIME2025 "Find m+n / a+b+c" cases. The scan is now bound
to the answer line and, when it shows arithmetic, reads the value after
the last '='. "Final answer: 082" -> "82" and the loose fallback are
preserved.
- regrade_trace_file: every case with a MODEL_OUTPUT block was graded,
including interrupted ones (STOPPED/SKIPPED/SWITCHED/ERROR). Their
partial output inflated passed/failed and raised spurious "changed"
drift. Grading is now factored into regrade_case_outcome(), which only
grades PASSED/FAILED (and legacy status-less) traces; others are
reported via a new not_graded counter.
Tested: make (Metal) and make cpu build clean (-Wall -Wextra);
./ds4-eval --self-test-extractors passes; regrade of a 3-case trace
(1 PASSED, 2 STOPPED) reports passed=1 not_graded=2 changed=0
(was passed=2 failed=1 changed=1).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
find_answer_letter returned the first boundary-isolated in-range capital
after "Answer:", so on 10-choice (A-J) cases a leading English pronoun or
article was graded as the choice:
"Answer: I think it is C" -> I (should be C)
"Answer: I'll go with C." -> I (should be C)
"Answer: A careful reading ... C" -> A (should be C)
24 embedded cases are 10-choice, so this is reachable. The forward scan
now skips a standalone capital that begins a same-line word or a
contraction; the reverse scan still recovers it when it is the only
candidate ("Answer: A is correct"), and a genuine standalone answer
("Answer: I.") is unchanged.
Known limitation: a distractor explicitly rejected *before* the chosen
letter on the same line ("... rules out C, leaving D") is still misread;
that needs sentence-level parsing and is left for discussion.
Locked by new --self-test-extractors cases (model-free). All prior
self-tests and the integer/regrade cases continue to pass.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Resolves the residual multiple-choice false negative tracked in antirez#321 (follow-up to antirez#319). find_answer_letter returned the first boundary-isolated in-range capital on the answer line, so when the model rejects a distractor before stating its pick the rejected letter won: "Answer: It is not B, the answer is D" -> B (should be D) "Answer: rules out C, leaving D" -> C (should be D) The forward scan now skips an in-range capital that is the object of an explicit, adjacent rejection -- "not B", "isn't B", "rule(s/d) out C", "eliminate/reject/except E". The cue set is deliberately small and high-precision: it rewrites only clear elimination phrasing and makes no attempt at general selection-vs-rejection sentence parsing, which is the principled limit noted in antirez#321. It never looks before the answer marker or across a newline, so a leading pick is accepted before any later rejected distractor is inspected -- "Answer: D, not B" still grades D, the regression that blocks the naive "take the last letter" fix. Locked by new --self-test-extractors cases (model-free, no model/GPU), including the "D, not B" guard. All prior extractor, integer, and regrade self-tests continue to pass. Tested: make ds4-eval (Metal) and make cpu build clean (-Wall -Wextra); ./ds4-eval --self-test-extractors passes on both the Metal and CPU binaries. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes ds4-agent so that it can run with
--backend rocm.Testing evidence
Successfully ran
ds4-agenton a Framework Desktop: