Inference Radar

TL;DR

Memory is the battleground: vLLM, SGLang, TensorRT-LLM, FlashInfer, ROCm AITER, and OpenVINO all pushed KV-cache, MLA, MoE, and low-bit serving paths because long-context inference is now gated by memory more than FLOPs.1
Local inference got more serious: llama.cpp and whisper.cpp shipped a heavy backend week across Vulkan, Hexagon, SYCL, CUDA, Metal, OpenVINO, and server UX, making local runtimes look more like production serving systems.2
Edge AI moved from samples to toolchains: Google LiteRT, ExecuTorch, Qualcomm AI Hub, sherpa-onnx, Cactus, and MNN all expanded model export, QNN, NPU, Apple Neural Engine, and mobile runtime paths.
Apple Silicon inference kept widening: Apple MLX added Metal zero-copy DLPack interop and more quantized matmul work, while the wider MLX ecosystem pushed VLM, audio, speculative decoding, and app-serving fixes.3
Inference economics moved up a level: OpenAI disclosed a custom inference processor with Broadcom, Qualcomm moved to buy Modular, and Baseten reportedly chased another large round, all pointing to inference as the new margin fight.

This Week in Inference

Moonshot AI released a one-trillion-parameter Kimi coding MoE that claims lower reasoning-token use, and exo added a model card for Kimi Code with distributed Mac Studio test notes and vision-tower metadata extraction in the exo repo.4 xAI shipped a faster Grok Imagine video model with synchronized audio generation, which matters because multimodal inference now looks less like a chain of small services and more like one composite serving path in the announcement.5 The practical point is simple: model launches now compete on serving cost, context behavior, and multimodal latency as much as benchmark score.

The optimization story was even clearer. KV-cache compression kept moving from paper to implementation: vLLM added INT4 per-token-head KV-cache quantization, while OpenVINO, SGLang, TensorRT-LLM, ROCm AITER, and FlashInfer all worked on adjacent KV, MLA, paged-attention, and low-bit decode paths in vLLM.6 Speculative decoding also kept turning into a default assumption rather than an optional trick, with SGLang refactoring draft runners and vLLM fixing spec-decode serving behavior in SGLang.7

Hardware news gave the week its market frame. OpenAI and Broadcom disclosed an LLM inference processor aimed at performance per watt, while Qualcomm moved to acquire Modular, a deal that puts compiler, runtime, and deployment software closer to Snapdragon and edge inference strategy in Modular’s release stream.8 The code data matched the deals: Qualcomm AI Hub expanded GenieX and model tooling, Modular pushed Apple GPU serving and model coverage, and Google’s LiteRT stack kept building the path from export to on-device apps in Qualcomm AI Hub models.9

Deeper Dive

Everything below is for readers who want the full picture. Feel free to scroll.

Code Changes by Category

Cloud & Datacenter Serving

vLLM had the most important cloud-serving week in raw systems scope: the project merged Blackwell sparse MLA and MoE enablement, Triton INT4 KV-cache quantization, FlashInfer NVFP4 GEMM priority routing, DCP and sparse FlashAttention work, ROCm and XPU cleanup, and stable-libtorch kernel migration in the vLLM activity.21 vLLM Gaudi tracked upstream scheduler and MoE churn with MoERunner adaptation, FastAPI pinning, HPU sampler fixes, Qwen evaluation coverage, and hourly CI fixes in vllm-gaudi.22

SGLang moved at similar speed, with FP8/NVFP4 MoE paths, speculative decoding refactors, scheduler and ZMQ cleanup, HiCache/NIXL reliability fixes, and model coverage for GLM, DeepSeek, Kimi, Qwen, Krea, and Wan in SGLang.23 Ray Serve improved HAProxy and direct ingress reliability, added CPU-only vLLM validation, and expanded KV connector and routing tests for Serve LLM in Ray.24 ai-dynamo worked on header-based session affinity, routing priority, policy queues, engine integrations for vLLM, SGLang and TensorRT-LLM, Kubernetes conversion safety, observability, replay, and SBOM hygiene in Dynamo.25

NVIDIA’s cloud stack split across TensorRT, TensorRT-LLM, Triton Inference Server, CUTLASS, and FlashInfer. TensorRT added CUDA, Python, container, ragged attention, and plugin updates in TensorRT.26 TensorRT-LLM advanced DeepSeek/MoE runtime cache work, FlashInfer Blackwell cross-attention, multimodal CUDA graph wrapping, MiniMax-M3 bring-up, disaggregated serving tests, VisualGen paths, and a large CI refresh in TensorRT-LLM.27 Triton Inference Server capped streaming tool-call parse buffers, redacted restricted API headers, and added Dynamo launch-mode QA in Triton Server.28

ROCm’s AITER and ATOM pushed AMD inference throughput hard: MLA, paged attention, KV-cache, DeepSeek-V4, MiniMax-M3, Qwen, GLM, gfx12, MI350, MI450, and RDNA paths all moved in one week in AITER.29 AMDMIGraphX added GPU NonMaxSuppression, symbolic ONNX parsing, pointer deref for paged-attention tests, tuning inputs, quantized ONNX fixes, and gfx942 build work in AMDMIGraphX.30 FlashInfer added BF16 by FP4 W4A16 GEMM, FP4 tactics, GDN context-parallel kernels, BF16-state speculative modes, TensorRT-LLM ragged MLA query metadata, and serving-critical cache and sampling fixes in FlashInfer.31

Local LLM Runtimes

llama.cpp dominated local runtime code volume with backend work across Hexagon, Vulkan, SYCL, CUDA, WebGPU, OpenCL, AMX, and model conversion, plus server progress events, router IPC, speculative-model progress, draft-context checks, tool-call IDs, Jinja support, JSON-schema grammar fixes, and Windows UTF-8 paths in llama.cpp.32 whisper.cpp synced ggml from llama.cpp, refreshed OpenVINO behavior, fixed Windows BLAS artifacts, added Parakeet Apple xcframework support, and improved SYCL, Vulkan, Metal, CUDA, OpenCL, and WebGPU coverage in whisper.cpp.33

Ollama updated llama.cpp integration, added CUDA platform entries, fixed prompt shifting for long coding-agent prompts, aligned /api/generate with chat-template rendering, improved GPU memory accounting and multimodal projector offload, advanced MLX speculative decoding, and added Claude Code and OpenCode launch integrations in Ollama.34 LocalAI shipped a broad perception, voice, distributed-serving, UI, and privacy release with depth, sound-event, TTS, and PII-filtering backends, plus realtime compaction and GPU defaults in LocalAI.35 llamafile improved GPU support docs and AMD MI300-class ROCm build scripts, while keeping llama.cpp update workflow and GitHub Actions pinning under review in llamafile.36

Apple Silicon & MLX Ecosystem

Apple MLX added Metal zero-copy DLPack interop, CUDA and Metal quantized matmul work, JIT fixes, Array API coverage, autodiff improvements, quantized LLM correctness fixes, and build/runtime hardening in MLX.19 mlx-lm fixed DeepSeek and GLM DSA indexer RoPE behavior, while mlx-swift-lm added speculative-decoding telemetry, model conversion, LFM embedders, Gemma loading fixes, KV-cache scheme selection, and iOS build fixes in mlx-lm.37

The third-party MLX layer moved fast. oMLX shipped GLM MoE DSA pre-load, FP8 sensitivity streaming, Sparse MLA kernels, SSD cache accounting fixes, prefix-cache serialization, in-flight model unload protection, and macOS app polish in oMLX.38 mlx-vlm added Moondream2, accelerated PaddleOCR-VL, Qwen Gated Delta prefill, TurboQuant prefill, server reload fixes, OpenAI-compatible reasoning fields, and safer streaming cache cleanup in mlx-vlm.39 mlx-audio added Higgs batch generation, MOSS-TTS, explicit reference-audio reuse, and broad thread-safety cleanup before MLX arrays cross worker threads in mlx-audio.40

Mobile & Edge Frameworks

ExecuTorch widened Arm/TOSA support, added MXFP formats, improved shape serialization, added MLX serving support for a Qwen C++ runner, improved CUDA/Gemma long-context and int8 paths, and worked across WebGPU, Vulkan, XNNPACK, Qualcomm, and NXP backends in ExecuTorch.41 Google’s LiteRT stack added memory-safety validation, Gemma runner work, LiteRT-LM multimodal CLI support, tool calling, activation dtype control, packaging, AI Edge Gallery Hugging Face import for LiteRT-LM models, XNNPACK kernels, and AI Edge Quantizer profiling-backed calibration in LiteRT-LM.42

Qualcomm expanded GenieX tutorials, ai-hub model discovery, runtime metadata, Llama and Qwen catalog coverage, Windows and Android sample app onboarding, and Nexa SDK sync work in ai-hub-models.9 sherpa-onnx made Qualcomm QNN the center of its week, adding export and C++ runtime support for Whisper, Parakeet CTC, and Parakeet TDT-CTC models in sherpa-onnx.43 MNN added fused LLM runtime/operator work, QLoRA export docs, RISC-V Vector optimizations, Metal attention fixes, Vulkan Range, and shader-generator cleanup in MNN.44 Cactus added batch inference, continuous batching, MoE support for LFM, Qwen and Gemma, ANE audio fixes for Gemma, and Parakeet transcription cancellation in Cactus.45

Compilers, Runtimes & Graph Engines

Apache TVM shipped a release while expanding Relax frontend coverage for TFLite, ONNX, and PyTorch, adding TensorRT partitioning, Metal shader compilation, analyzer work, FFI coordination, and docs cleanup in TVM.46 ONNX and ONNX Runtime aligned Attention cache-mask semantics, while ONNX Runtime also added CUDA XQA defaults, sliding-window XQA decode, attention-sink decode, WebGPU fixes, MLAS/KleidiAI, QNN, graph optimizers, CUDA builds, and packaging updates in ONNX Runtime.47

JAX and OpenXLA spent the week on compiler and runtime internals: Pallas and Mosaic GPU/TPU lowering, ReplicaGroupV3, StrongLRUCache, ROCm VMM allocator plumbing, GPU collective cleanup, fusion/autotuning, async HLO changes, PJRT and IFRT interface hardening, and StableHLO/VHLO compatibility fixes in OpenXLA.48 Triton moved Python bindings to nanobind, fixed CUDA graph capture lifetime, hardened AMD ROCm paths under PyTorch Inductor pressure, added gfx1250 TDM work, and advanced Blackwell/Gluon low-precision matmul in Triton.49 TileLang added LLVM backend support, backend registries, TMA and GMMA layout work, Blackwell stmatrix and scheduler fixes, and CUDA pipeline cleanup in TileLang.50

Models, Quantization & Optimization

Hugging Face added Krea to Diffusers, VideoPrism, MiniCPM, Nemotron ASR Streaming, DiffusionGemma fixes, Candle CUDA graph caching, CPU flash attention refactors, ARM NEON quantization, Metal SDPA routing fixes, and Diffusers BitsAndBytes on Apple MPS in Transformers.51 OpenVINO coordinated GPU, NPU, GenAI, and NNCF work on 4-bit KV-cache, shared-KV paged attention, Eagle speculative decoding, LLM compression examples, asymmetric compression, packed MatMulNBits names, and ASR pipeline support in OpenVINO GenAI.52 LiteLLM expanded Rust/OCR foundations, realtime translation, code-interpreter sandboxing, MCP permissions, security controls, provider coverage, SCIM, cost accounting, and admin UX in LiteLLM.53

DeepSpeed focused on reliability, with default gradient clipping, FP16 dynamic loss-scale validation, activation-checkpointing failures, ZeRO mixed-dtype all-gather fixes, async I/O descriptor cleanup, typing, CI cancellation, and DCO checks in DeepSpeed.54 KTransformers added MXFP8 MoE kernels for AMX and AVX2, MiniMax-M3 docs, SGLang submodule updates, CUDA Graph replay fixes, AVX512 build fixes, and DeepSeek setup guidance in KTransformers.55 LightLLM added Qwen performance work and queued FP8 W8A8 per-tensor quantization, FlashAttention and FlashInfer optimizations, token-ID validation, and Qwen streaming function-call fixes in LightLLM.56

Other Notable Changes

Osaurus shipped a heavy macOS AI-agent week with native image generation and editing, agent delegation, Computer Use evidence packs, remote-agent reliability, runtime model fixes, localization, telemetry, and six releases in Osaurus.57 RunanywhereAI improved Web SDK cold-start hydration, browser downloads, VLM thread-pool behavior, RAG model downloads, Vision image loading, and WASM dependency strategy in runanywhere-sdks.58 Zetic added on-device meeting notes, offline translation, health and camera demos, Android backup hardening, microphone permission fixes, and Swift model lifecycle cleanup in ZETIC Melange apps.59

Community Pulse

The loudest support themes were the same across projects: long-context memory pressure, tool-call parser compatibility, GPU backend mismatch, and install/build fragility. Ollama users pushed on Claude Code, OpenCode, reasoning-field compatibility, slot/KV-cache exposure, and GPU utilization in Ollama issues.60 Open WebUI users focused on streaming browser load, RAG payload bloat, Ollama context sizing, native web-search RAG, IME behavior, and deployment reliability in Open WebUI.61

Backend correctness reports were sharp. FlashAttention users found Blackwell SM100 non-contiguous head-dim correctness issues and moved from a contiguous workaround toward a stride-aware kernel in flash-attention.62 vLLM users tracked DFlash readiness, GLM optimization, INT2 KV-cache proposals, NVFP4 host-RAM OOM during JIT, Qwen tool-call format changes, and ROCm parser output bugs in vLLM.63 SGLang users reported GLM FP8 crashes, HiSparse long-context design needs, HiCache leaks, and ROCm FP8 correctness issues in SGLang.64

Edge users reported real deployment friction. WebLLM users hit Adreno WebGPU device-loss during warm-up in WebLLM.65 sherpa-onnx users asked for RKNN hotwords, Vietnamese Kokoro TTS, ZipVoice WASM heap corruption, GPT-SoVITS ONNX, Hush speech enhancement, and publication references in sherpa-onnx.66 OpenVINO users reported GPU reset-state races, HETERO split behavior, Windows GPU discovery gaps, wrong boxes, and Parakeet compile OOM behavior in OpenVINO.67

Community Debates

llama.cpp maintainers pushed back on backend-first FP8 Vulkan work because reviewers wanted CPU quant-type viability, quality benchmarks, and staged changes before backend kernels in llama.cpp.68 That debate captures a recurring local-runtime pattern: new quant formats need model-quality proof before each accelerator backend grows custom code.

Ollama rejected slot and KV-cache API exposure proposals because maintainers want a limited, runner-agnostic public API instead of passing llama.cpp internals through the server in Ollama.69 The slot debate will keep coming back as coding agents and long-running desktop sessions ask for explicit cache save and restore.

vLLM maintainers split or closed several DFlash and sparse MLA proposals when PRs mixed too many concerns or added synchronization costs that clashed with CUDA graph assumptions in vLLM.70 The direction remains clear, but reviewers are forcing the path through narrower attention, SWA, FP8, and verification slices.

SGLang closed a PyTorch stack bump after unresolved CUDA graph, aux-stream, OOM, accuracy, Mooncake, and dependency blockers in SGLang.71 That closure shows why serving engines move slowly on base stack upgrades even when they move fast on kernels.

Apache TVM reversed explicit NaN-preservation wrappers after maintainers judged the common-path performance cost too high in TVM.72 The debate shows a classic compiler tradeoff: standards-edge behavior can lose when it taxes hot inference paths.

Google’s edge samples exposed license policy as a hard gate, with a Zero-DCE sample closed after non-commercial or academic-only terms failed the contribution bar in litert-samples.73 As model catalogs become app stores for on-device AI, license filtering becomes runtime infrastructure.

Worth Watching

KV-cache quantization will spread fast: vLLM’s INT4 work, OpenVINO’s 4-bit KV-cache path, LMDeploy’s FP8 KV-cache, and KVarN-style research all point to cache memory as the next standard optimization layer in vLLM.6
Speculative decoding is becoming architecture work: SGLang’s decoupled draft runners, Ollama’s MLX speculative loop, ExecuTorch long-context serving, and vLLM spec-decode fixes show the feature moving into scheduler design in SGLang.7
QNN and mobile NPUs are heating up: Qualcomm AI Hub, sherpa-onnx, MNN, LiteRT, ExecuTorch, Cactus, and Modular all touched QNN, NPU, ANE, or edge deployment paths this week in Qualcomm AI Hub.9
Tool-call parsers are production attack surfaces: Triton capped parser bytes, LMDeploy fixed XML tool-call cache leakage, Ollama and Open WebUI faced reasoning and tool compatibility issues, and vLLM moved parsing into derender paths in Triton Server.28
ROCm’s inference stack is becoming a vertical lane: AITER, ATOM, AMDMIGraphX, ROCm flash-attention, SGLang, and vLLM all moved AMD serving paths at once in AITER.29

Major Releases

ggml shipped a rapid llama.cpp build train ending at b9784, with the week focused on Hexagon matmul overhaul, Vulkan correctness and shader behavior, server model-load progress, speculative model support, structured output fixes, and broad backend portability. whisper.cpp shipped v1.9.1 to fix Windows BLAS artifacts while continuing ggml sync and Parakeet packaging work. Latest llama.cpp release.74

NVIDIA shipped TensorRT v11.1, TensorRT-LLM v1.3.0rc19, and CUTLASS v4.2.2, v4.3.6, and v4.4.3. The dominant theme was Blackwell-era serving infrastructure: CUDA 13.3 defaults, Python 3.14 support, ragged TRT Attention, MiniMax-M3 and Step model coverage, and CUTLASS fixes for NVRTC, linking, and Blackwell blockscaled GEMM. TensorRT release notes.26

Microsoft shipped ONNX Runtime v1.27.0, focused on ONNX 1.21 support, CUDA package transition notes, CUDA 13 direction, and a SoftmaxCrossEntropyLoss label-bounds fix. The release followed a week of Attention KV-cache semantic alignment and LLM decode optimization work..47

Apache shipped TVM v0.25.0, covering Relax, frontend coverage, TIR, runtime, analyzer, TensorRT BYOC docs, Metal compilation, and coordinated tvm-ffi work. The release came with active follow-on work toward TVM 0.26 development..46

Meta shipped PyTorch v2.12.1 as a bug-fix release for regressions and silent correctness issues, including B200 FlashAttention nondeterminism and B100/B200 Triton convolution backward illegal-memory-access fixes. ExecuTorch did not publish a release, but it carried a large backend and edge-inference week in the same org..75

Google shipped AI Edge Gallery 1.0.16 and LiteRT-LM v0.14.0-alpha.0. The release theme was on-device LLM management and broader testing: Gallery added Hugging Face LiteRT-LM import and management, while LiteRT-LM published an alpha for wider validation. AI Edge Gallery release notes.16

Modular shipped MAX 26.4 and Mojo 1.0.0b2, adding broader Apple silicon GPU serving, Hunyuan Hy3-preview, LiquidAI LFM2, GLM FP8/NVFP4, and DeepSeek V2/V3 serving improvements. The release also framed Modular as a more important edge story given Qualcomm’s acquisition move..8

mudler shipped LocalAI v4.5.0, a large perception, realtime, distributed-serving, backend-packaging, and UI release. New depth, sound-event, TTS, and privacy-filter backends joined speaker-aware realtime sessions, compaction, model aliases, ASR timestamps, Vulkan packaging, and distributed staging fixes..35

InternLM shipped lmdeploy v0.14.0, centered on FP8 KV-cache quantization, Qwen Omni support, Qwen ViT inference in TurboMind, and an OpenAI Responses-compatible endpoint. The surrounding work fixed SSM prefix-cache eviction, tool-call parser caching, FA3 prefill padding, decode KV accounting, and cache observability..76

ROCm shipped AITER v0.1.15.post2, v0.1.16, v0.1.16.post1, v0.1.15.post3, and ATOM v0.1.5. The release arc paired AITER kernel updates with ATOM serving-stack updates for DeepSeek, MiniMax-M3, MLA, MoE, FlyDSL, MXFP8, manylinux compatibility, and ROCm serving images. AITER release notes.29

Dao-AILab shipped flash-attention fa4-v4.0.0.beta19. The release focused on SM100 head-dim correctness through a contiguous workaround and sparse MLA backward kernels for DeepSeek-style attention, with a stride-aware follow-up still under review..77

FlashInfer shipped v0.6.13 plus nightly builds through the week. The release focused on CI and OOM test isolation, autotuner file-cache differentiation, and Blackwell GQA decode integration, while the code stream added BF16 by FP4 GEMM, FP4 tactics, GDN, and MoE fixes..78

BerriAI shipped LiteLLM v1.86.7 plus development, release-candidate, and maintenance releases across the week. The common thread was enterprise hardening: signed Docker image verification, Rust/OCR foundations, realtime translation, code-interpreter sandboxing, MCP access controls, cost correctness, and provider expansion..79

k2-fsa shipped sherpa-onnx asr-models-qnn-2 and asr-models-qnn-binary-2. The releases support the week’s Qualcomm QNN push for Whisper and Parakeet export and runtime paths on NPUs. Model release notes.80

Qualcomm shipped ai-hub-models v0.56.0. The release updated model data and support for Detectron2, BERT, WaveLM, VideoMAE, RTMDet quantization, Qwen VL metadata, and PiperTTS language descriptions, while the repo also expanded GenieX, LLM catalog, and device discovery work..9

triton-lang shipped Triton v3.7.1 as a patch release for regressions, including async shared-memory dependency handling in FenceAsync. The release sat inside a broader compiler week covering nanobind, CUDA graph lifetime, ROCm Inductor fixes, gfx1250 TDM, and Blackwell low-precision matmul..49

osaurus-ai shipped six Osaurus releases from 0.20.4 through 0.20.9. The cadence covered Computer Use scorecards, agent channels, screenshots, proxy controls, runtime model fixes, localization, MCP state persistence, capabilities loading, temperature handling, and macOS agent UX. Latest release notes.57

Qualcomm Swallows Modular As Inference Splinters