Skip to content

ds4_agent: fix rocm support#383

Open
mcmalayalam wants to merge 21 commits into
antirez:mainfrom
mcmalayalam:ds4-agent_fix_rocm
Open

ds4_agent: fix rocm support#383
mcmalayalam wants to merge 21 commits into
antirez:mainfrom
mcmalayalam:ds4-agent_fix_rocm

Conversation

@mcmalayalam

Copy link
Copy Markdown

Fixes ds4-agent so that it can run with --backend rocm.

Testing evidence

Successfully ran ds4-agent on a Framework Desktop:

$ ./ds4-agent \
        --backend rocm \
        --threads 30 \
        --model ~/models/antirez/deepseek-v4-gguf/DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf \
        --mtp ~/models/antirez/deepseek-v4-gguf/DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf \
        --mtp-draft 1 \
        --ctx 16000 \
ds4: MTP support model loaded: /home/sam/models/antirez/deepseek-v4-gguf/DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf (draft=1)
ds4: ROCm backend initialized on AMD Radeon Graphics (sm_115)
ds4: ROCm chunk-copying 80.76 GiB model image
ds4: ROCm loading model tensors into device cache: 80.06 GiBds4: ROCm model chunk copy complete in 18.271s (80.76 GiB tensors)
ds4: ROCm chunk-copying 3.55 GiB model image
ds4: ROCm loading model tensors into device cache: 2.06 GiBds4: ROCm model chunk copy complete in 0.782s (3.55 GiB tensors)
ds4: ROCm preparing model tensor mappings: 80.24 GiB
ds4: ROCm startup model preparation covered 80.76 GiB of tensor spans in 0.694s
ds4: ROCm preparing model tensor mappings: 2.38 GiB
ds4: ROCm startup model preparation covered 3.55 GiB of tensor spans in 0.012s
ds4: rocm backend initialized for graph diagnostics
DwarfStar 🐋 Agent, context 16k tokens


ds4-agent> 

mcmalayalam and others added 21 commits June 10, 2026 17:24
Add prompt_ms, prompt_per_second, predicted_ms, predicted_per_second
timing fields to streaming and non-streaming OpenAI API responses.

- Define server_timings struct with prompt/decode timing fields
- Extend server_prefill_progress with prompt_ms/prompt_per_second
- Store final prefill timing in server_progress_cb()
- Add append_openai_timings_json() helper
- Thread timings through sse_usage_chunk, sse_done, sse_chat_finish,
  openai_sse_finish_live, and final_response
- Capture prefill timing after sync completes in generate_job()
- Compute decode timing after decode loop and build timings struct
- Emit timings in streaming SSE usage chunk and non-streaming JSON body
- Add test_openai_final_response_reports_timings() for non-streaming
- Update test_openai_stream_usage_reports_cache_details to verify timings
- Fix existing openai_sse_finish_live test calls to pass NULL timings
Track routed-expert cache entries referenced by encoded Metal command buffers and prevent cache eviction or buffer reuse until the owning command buffers have completed. This preserves layer-batched decode while avoiding stale buffer captures under SSD-streaming cache pressure.

Add a focused Metal SSD-streaming cache-pressure regression for issue antirez#384.
Avoid eager ROCm model copies for range/span startup. Distributed layer slices can be sparse, so materializing the full mapped image or the min..max envelope can allocate far more memory than the selected tensors need. Keep the map metadata and let accelerator_cache_model_tensors() prepare the exact tensor spans instead.\n\nFixes antirez#387.
…ants)

GGUFs whose routed experts are not uniformly quantized across layers (e.g.
IQ2_XXS/Q2_K everywhere with a few layers upcast to Q4_K via --tensor-type)
failed every request under --ssd-streaming: the batch-selected-addr prefill
maps layers with the decode-static span set (no exps tensors), the Metal
encoder's IQ2-only gate sends non-IQ2 layers to the full-tensor
wrap_model_range fallback, and no installed view covers the range
("Metal model range ... is not covered by mapped model views",
"gpu layer N ffn batch encode failed"). Latently, the expert cache's
single-size-class slab allocator would also be poisoned by off-size
experts (last-writer-wins note_expert_size; budget distortion; potential
slab-reuse starvation after an mlock cap).

Fix, in two pieces:

- Detect boosted layers (per-expert bytes != the slab class derived from
  the first routed layer; weights_streaming_layer_experts_uniform) and
  include their ffn_{gate,up,down}_exps tensors in the decode-span
  builders (model_map_span_vec_include_layer_decode), so the existing
  wrap_model_range / wrap_model_exact_range fallbacks find their ranges
  covered for both prefill and decode.

- Pre-seed the expert-cache slab size class at startup
  (ds4_gpu_set_streaming_expert_cache_expert_bytes; no-op stubs for
  CUDA/ROCm) and make note_expert_size freeze+reject instead of
  last-writer-wins: off-size layers deterministically bypass the cache
  and use the mapped-view path. Startup logs the uniform/boosted split
  and warns if the majority of routed layers are off-class.

Uniform models are byte-identical by construction (helper returns true
for every layer -> no extra spans; the pre-seed equals what the first
note call would have written) and verified by greedy replay of recorded
benchmark answers. Validated on DeepSeek-V4-Flash 97GB 6-layer-Q4_K
boost on a 64GB machine (suite of code benchmarks, 0 transport errors);
diagnosis confirmed end-to-end via the legacy-path env escape on the
unpatched binary. Details, validation table and reviewer notes in
STREAMING_MIXED_PRECISION.md.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
(cherry picked from commit fefa426)
Share the routed expert byte-size calculation between the base streaming cache sizing and per-layer mixed-precision checks, preserving overflow handling in both paths. Remove the standalone mixed-streaming design note, since this behavior is an implementation detail rather than user documentation.
Document the release validation pass across Metal, CUDA, ROCm, SSD streaming, distributed inference, disk KV cache, server APIs, and ds4-agent. Include the mixed-quant Flash SSD streaming regression check for boosted Q4 routed-expert layers.
Add ROCm SSD streaming support for PRO/Flash paths, including selected expert caching, overlapped reads, full-layer streaming prefill, resident cache lookups, and ROCm cache reuse handling. Add the corresponding CUDA SSD streaming expert cache backend and shared graph integration, with bounded cache allocation and progress reporting.\n\nThis squashes the rocm-streaming branch work on top of current main.
Track live Metal tensor handles under a mutex so stale or duplicate opaque tensor frees do not hand an already-released Objective-C object to ARC. Serialize the runtime tensor allocation counters through the same lock, including memory reports and cleanup.

Closes antirez#404
Do not default ROCm builds to an 8192-token prefill workspace for long Flash prompts. On Strix Halo, full IQ2 weight caching plus that workspace can exhaust HSA allocation space before generation starts; the normal Flash default of 4096 leaves enough room while preserving explicit --prefill-chunk overrides.

Closes antirez#387
Let ROCm tensor range lookup use existing prepared ranges and fd-backed caching before falling back to uncached copies when a request comes from a non-current model mmap. This lets the main GGUF and the MTP GGUF coexist without forcing the main model through the unlimited copy path.

When a second model mmap is registered, disable the optional expanded Q8/F16 cache on ROCm. The cache is only an acceleration path and can consume the memory margin needed for MTP session/context tensors on Strix Halo UMA systems; normal Q8 kernels remain available.

Closes antirez#425
Guard dynamic buffer growth against capacity wraparound, normalize malformed surrogate escapes to replacement characters without consuming unrelated escapes, and add a tool-heavy OpenCode-shaped parser stress test for issue antirez#405.

Closes antirez#405.
Use a const pointer for stop-list substring matches so GCC/ROCm builds remain warning-clean.
Two model-free correctness fixes in the answer grader, locked by new
--self-test-extractors cases (no model/GPU needed):

- find_integer_answer: the answer-marker path took the first integer in
  the window, so an answer line that shows its arithmetic
  ("Answer: m+n = 256+37 = 293") was graded as the first summand (256)
  instead of the stated total (293) -> false negative. Reachable on the
  many embedded AIME2025 "Find m+n / a+b+c" cases. The scan is now bound
  to the answer line and, when it shows arithmetic, reads the value after
  the last '='. "Final answer: 082" -> "82" and the loose fallback are
  preserved.

- regrade_trace_file: every case with a MODEL_OUTPUT block was graded,
  including interrupted ones (STOPPED/SKIPPED/SWITCHED/ERROR). Their
  partial output inflated passed/failed and raised spurious "changed"
  drift. Grading is now factored into regrade_case_outcome(), which only
  grades PASSED/FAILED (and legacy status-less) traces; others are
  reported via a new not_graded counter.

Tested: make (Metal) and make cpu build clean (-Wall -Wextra);
./ds4-eval --self-test-extractors passes; regrade of a 3-case trace
(1 PASSED, 2 STOPPED) reports passed=1 not_graded=2 changed=0
(was passed=2 failed=1 changed=1).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
find_answer_letter returned the first boundary-isolated in-range capital
after "Answer:", so on 10-choice (A-J) cases a leading English pronoun or
article was graded as the choice:

    "Answer: I think it is C"        -> I   (should be C)
    "Answer: I'll go with C."        -> I   (should be C)
    "Answer: A careful reading ... C" -> A   (should be C)

24 embedded cases are 10-choice, so this is reachable. The forward scan
now skips a standalone capital that begins a same-line word or a
contraction; the reverse scan still recovers it when it is the only
candidate ("Answer: A is correct"), and a genuine standalone answer
("Answer: I.") is unchanged.

Known limitation: a distractor explicitly rejected *before* the chosen
letter on the same line ("... rules out C, leaving D") is still misread;
that needs sentence-level parsing and is left for discussion.

Locked by new --self-test-extractors cases (model-free). All prior
self-tests and the integer/regrade cases continue to pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Resolves the residual multiple-choice false negative tracked in antirez#321
(follow-up to antirez#319). find_answer_letter returned the first
boundary-isolated in-range capital on the answer line, so when the model
rejects a distractor before stating its pick the rejected letter won:

    "Answer: It is not B, the answer is D"   -> B   (should be D)
    "Answer: rules out C, leaving D"         -> C   (should be D)

The forward scan now skips an in-range capital that is the object of an
explicit, adjacent rejection -- "not B", "isn't B", "rule(s/d) out C",
"eliminate/reject/except E". The cue set is deliberately small and
high-precision: it rewrites only clear elimination phrasing and makes no
attempt at general selection-vs-rejection sentence parsing, which is the
principled limit noted in antirez#321. It never looks before the answer marker
or across a newline, so a leading pick is accepted before any later
rejected distractor is inspected -- "Answer: D, not B" still grades D,
the regression that blocks the naive "take the last letter" fix.

Locked by new --self-test-extractors cases (model-free, no model/GPU),
including the "D, not B" guard. All prior extractor, integer, and
regrade self-tests continue to pass.

Tested: make ds4-eval (Metal) and make cpu build clean (-Wall -Wextra);
./ds4-eval --self-test-extractors passes on both the Metal and CPU
binaries.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants