ds4_agent: fix rocm support by mcmalayalam · Pull Request #383 · antirez/ds4

mcmalayalam · 2026-06-10T16:31:59Z

Fixes ds4-agent so that it can run with --backend rocm.

Testing evidence

Successfully ran ds4-agent on a Framework Desktop:

$ ./ds4-agent \
        --backend rocm \
        --threads 30 \
        --model ~/models/antirez/deepseek-v4-gguf/DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf \
        --mtp ~/models/antirez/deepseek-v4-gguf/DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf \
        --mtp-draft 1 \
        --ctx 16000 \
ds4: MTP support model loaded: /home/sam/models/antirez/deepseek-v4-gguf/DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf (draft=1)
ds4: ROCm backend initialized on AMD Radeon Graphics (sm_115)
ds4: ROCm chunk-copying 80.76 GiB model image
ds4: ROCm loading model tensors into device cache: 80.06 GiBds4: ROCm model chunk copy complete in 18.271s (80.76 GiB tensors)
ds4: ROCm chunk-copying 3.55 GiB model image
ds4: ROCm loading model tensors into device cache: 2.06 GiBds4: ROCm model chunk copy complete in 0.782s (3.55 GiB tensors)
ds4: ROCm preparing model tensor mappings: 80.24 GiB
ds4: ROCm startup model preparation covered 80.76 GiB of tensor spans in 0.694s
ds4: ROCm preparing model tensor mappings: 2.38 GiB
ds4: ROCm startup model preparation covered 3.55 GiB of tensor spans in 0.012s
ds4: rocm backend initialized for graph diagnostics
DwarfStar 🐋 Agent, context 16k tokens


ds4-agent>

Add prompt_ms, prompt_per_second, predicted_ms, predicted_per_second timing fields to streaming and non-streaming OpenAI API responses. - Define server_timings struct with prompt/decode timing fields - Extend server_prefill_progress with prompt_ms/prompt_per_second - Store final prefill timing in server_progress_cb() - Add append_openai_timings_json() helper - Thread timings through sse_usage_chunk, sse_done, sse_chat_finish, openai_sse_finish_live, and final_response - Capture prefill timing after sync completes in generate_job() - Compute decode timing after decode loop and build timings struct - Emit timings in streaming SSE usage chunk and non-streaming JSON body - Add test_openai_final_response_reports_timings() for non-streaming - Update test_openai_stream_usage_reports_cache_details to verify timings - Fix existing openai_sse_finish_live test calls to pass NULL timings

Track routed-expert cache entries referenced by encoded Metal command buffers and prevent cache eviction or buffer reuse until the owning command buffers have completed. This preserves layer-batched decode while avoiding stale buffer captures under SSD-streaming cache pressure. Add a focused Metal SSD-streaming cache-pressure regression for issue antirez#384.

Avoid eager ROCm model copies for range/span startup. Distributed layer slices can be sparse, so materializing the full mapped image or the min..max envelope can allocate far more memory than the selected tensors need. Keep the map metadata and let accelerator_cache_model_tensors() prepare the exact tensor spans instead.\n\nFixes antirez#387.

…ants) GGUFs whose routed experts are not uniformly quantized across layers (e.g. IQ2_XXS/Q2_K everywhere with a few layers upcast to Q4_K via --tensor-type) failed every request under --ssd-streaming: the batch-selected-addr prefill maps layers with the decode-static span set (no exps tensors), the Metal encoder's IQ2-only gate sends non-IQ2 layers to the full-tensor wrap_model_range fallback, and no installed view covers the range ("Metal model range ... is not covered by mapped model views", "gpu layer N ffn batch encode failed"). Latently, the expert cache's single-size-class slab allocator would also be poisoned by off-size experts (last-writer-wins note_expert_size; budget distortion; potential slab-reuse starvation after an mlock cap). Fix, in two pieces: - Detect boosted layers (per-expert bytes != the slab class derived from the first routed layer; weights_streaming_layer_experts_uniform) and include their ffn_{gate,up,down}_exps tensors in the decode-span builders (model_map_span_vec_include_layer_decode), so the existing wrap_model_range / wrap_model_exact_range fallbacks find their ranges covered for both prefill and decode. - Pre-seed the expert-cache slab size class at startup (ds4_gpu_set_streaming_expert_cache_expert_bytes; no-op stubs for CUDA/ROCm) and make note_expert_size freeze+reject instead of last-writer-wins: off-size layers deterministically bypass the cache and use the mapped-view path. Startup logs the uniform/boosted split and warns if the majority of routed layers are off-class. Uniform models are byte-identical by construction (helper returns true for every layer -> no extra spans; the pre-seed equals what the first note call would have written) and verified by greedy replay of recorded benchmark answers. Validated on DeepSeek-V4-Flash 97GB 6-layer-Q4_K boost on a 64GB machine (suite of code benchmarks, 0 transport errors); diagnosis confirmed end-to-end via the legacy-path env escape on the unpatched binary. Details, validation table and reviewer notes in STREAMING_MIXED_PRECISION.md. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> (cherry picked from commit fefa426)

Share the routed expert byte-size calculation between the base streaming cache sizing and per-layer mixed-precision checks, preserving overflow handling in both paths. Remove the standalone mixed-streaming design note, since this behavior is an implementation detail rather than user documentation.

Document the release validation pass across Metal, CUDA, ROCm, SSD streaming, distributed inference, disk KV cache, server APIs, and ds4-agent. Include the mixed-quant Flash SSD streaming regression check for boosted Q4 routed-expert layers.

Add ROCm SSD streaming support for PRO/Flash paths, including selected expert caching, overlapped reads, full-layer streaming prefill, resident cache lookups, and ROCm cache reuse handling. Add the corresponding CUDA SSD streaming expert cache backend and shared graph integration, with bounded cache allocation and progress reporting.\n\nThis squashes the rocm-streaming branch work on top of current main.

Track live Metal tensor handles under a mutex so stale or duplicate opaque tensor frees do not hand an already-released Objective-C object to ARC. Serialize the runtime tensor allocation counters through the same lock, including memory reports and cleanup. Closes antirez#404

Do not default ROCm builds to an 8192-token prefill workspace for long Flash prompts. On Strix Halo, full IQ2 weight caching plus that workspace can exhaust HSA allocation space before generation starts; the normal Flash default of 4096 leaves enough room while preserving explicit --prefill-chunk overrides. Closes antirez#387

Let ROCm tensor range lookup use existing prepared ranges and fd-backed caching before falling back to uncached copies when a request comes from a non-current model mmap. This lets the main GGUF and the MTP GGUF coexist without forcing the main model through the unlimited copy path. When a second model mmap is registered, disable the optional expanded Q8/F16 cache on ROCm. The cache is only an acceleration path and can consume the memory margin needed for MTP session/context tensors on Strix Halo UMA systems; normal Q8 kernels remain available. Closes antirez#425

Guard dynamic buffer growth against capacity wraparound, normalize malformed surrogate escapes to replacement characters without consuming unrelated escapes, and add a tool-heavy OpenCode-shaped parser stress test for issue antirez#405. Closes antirez#405.

Use a const pointer for stop-list substring matches so GCC/ROCm builds remain warning-clean.

Two model-free correctness fixes in the answer grader, locked by new --self-test-extractors cases (no model/GPU needed): - find_integer_answer: the answer-marker path took the first integer in the window, so an answer line that shows its arithmetic ("Answer: m+n = 256+37 = 293") was graded as the first summand (256) instead of the stated total (293) -> false negative. Reachable on the many embedded AIME2025 "Find m+n / a+b+c" cases. The scan is now bound to the answer line and, when it shows arithmetic, reads the value after the last '='. "Final answer: 082" -> "82" and the loose fallback are preserved. - regrade_trace_file: every case with a MODEL_OUTPUT block was graded, including interrupted ones (STOPPED/SKIPPED/SWITCHED/ERROR). Their partial output inflated passed/failed and raised spurious "changed" drift. Grading is now factored into regrade_case_outcome(), which only grades PASSED/FAILED (and legacy status-less) traces; others are reported via a new not_graded counter. Tested: make (Metal) and make cpu build clean (-Wall -Wextra); ./ds4-eval --self-test-extractors passes; regrade of a 3-case trace (1 PASSED, 2 STOPPED) reports passed=1 not_graded=2 changed=0 (was passed=2 failed=1 changed=1). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

find_answer_letter returned the first boundary-isolated in-range capital after "Answer:", so on 10-choice (A-J) cases a leading English pronoun or article was graded as the choice: "Answer: I think it is C" -> I (should be C) "Answer: I'll go with C." -> I (should be C) "Answer: A careful reading ... C" -> A (should be C) 24 embedded cases are 10-choice, so this is reachable. The forward scan now skips a standalone capital that begins a same-line word or a contraction; the reverse scan still recovers it when it is the only candidate ("Answer: A is correct"), and a genuine standalone answer ("Answer: I.") is unchanged. Known limitation: a distractor explicitly rejected *before* the chosen letter on the same line ("... rules out C, leaving D") is still misread; that needs sentence-level parsing and is left for discussion. Locked by new --self-test-extractors cases (model-free). All prior self-tests and the integer/regrade cases continue to pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Resolves the residual multiple-choice false negative tracked in antirez#321 (follow-up to antirez#319). find_answer_letter returned the first boundary-isolated in-range capital on the answer line, so when the model rejects a distractor before stating its pick the rejected letter won: "Answer: It is not B, the answer is D" -> B (should be D) "Answer: rules out C, leaving D" -> C (should be D) The forward scan now skips an in-range capital that is the object of an explicit, adjacent rejection -- "not B", "isn't B", "rule(s/d) out C", "eliminate/reject/except E". The cue set is deliberately small and high-precision: it rewrites only clear elimination phrasing and makes no attempt at general selection-vs-rejection sentence parsing, which is the principled limit noted in antirez#321. It never looks before the answer marker or across a newline, so a leading pick is accepted before any later rejected distractor is inspected -- "Answer: D, not B" still grades D, the regression that blocks the naive "take the last letter" fix. Locked by new --self-test-extractors cases (model-free, no model/GPU), including the "D, not B" guard. All prior extractor, integer, and regrade self-tests continue to pass. Tested: make ds4-eval (Metal) and make cpu build clean (-Wall -Wextra); ./ds4-eval --self-test-extractors passes on both the Metal and CPU binaries. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

mcmalayalam and others added 21 commits June 10, 2026 17:24

ds4_agent: fix rocm support

2b3de37

Add /health endpoint that returns HTTP 200 once model is fully loaded

ce1c8ea

Add release QA checklist

11f2664

Document the release validation pass across Metal, CUDA, ROCm, SSD streaming, distributed inference, disk KV cache, server APIs, and ds4-agent. Include the mixed-quant Flash SSD streaming regression check for boosted Q4 routed-expert layers.

Fix ROCm mixed streaming expert fallback

f5bd767

Refactor streaming expert cache API

c19687b

Fix CPU build after streaming refactor

76e2c13

Fix distributed SSD streaming layer slices

a0e27d8

Fix server const warning

f4d6392

Use a const pointer for stop-list substring matches so GCC/ROCm builds remain warning-clean.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ds4_agent: fix rocm support#383

ds4_agent: fix rocm support#383
mcmalayalam wants to merge 21 commits into
antirez:mainfrom
mcmalayalam:ds4-agent_fix_rocm

mcmalayalam commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mcmalayalam commented Jun 10, 2026

Testing evidence

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants