SOTA low-bit LLM quantization (INT8/FP8/MXFP8/INT4/MXFP4/NVFP4) & sparsity; leading model compression techniques on PyTorch, TensorFlow, and ONNX Runtime
-
Updated
Jul 2, 2026 - Python
SOTA low-bit LLM quantization (INT8/FP8/MXFP8/INT4/MXFP4/NVFP4) & sparsity; leading model compression techniques on PyTorch, TensorFlow, and ONNX Runtime
[EMNLP 2024 & AAAI 2026] A powerful toolkit for compressing large models including LLMs, VLMs, and video generative models.
SNDR Core Engine (Genesis) — vLLM runtime patch-overlay for Qwen3.6 + Gemma4 on consumer NVIDIA (Ampere sm_86, 2× A5000/3090). Qwen3.6-35B-A3B FP8 ~240 tok/s, 27B-int4 hybrid GDN+Mamba, Gemma4 26B/31B AWQ, 256K ctx. 321 patches: TurboQuant k8v4 KV, MTP/DFlash spec-decode, FULL cudagraph, hybrid GDN. vLLM pin dev424 + Control Center GUI.
vLLM Qwen 3.6-27B (AWQ-INT4) + DFlash speculative decoding on AMD Strix Halo (gfx1151 iGPU, 128 GB UMA, ROCm 7.13). 24.8 t/s single-stream, vision, tool calling, 256K context, OpenAI-compatible, Docker. Matches DGX Spark FP8+DFlash+MTP at a third of the cost. No CUDA.
FastAPI wrapper around original Vibevoice 1.5B and 7B models, with support for AWQ4 quant
Native Windows build of vLLM 0.24.0 - no WSL, no Docker. Python 3.13 + CUDA 12.8 + PyTorch 2.11 cu128 for RTX 30/40/50-series, pre-built wheel, Windows patchset, 10 KV-cache compression dtypes, OpenAI API server fixes, Rust frontend, and Rust tool parser support.
A light, transparent, and modular inference & quantization engine for studying LLMs.
Compress Any LLM Up to 6x in One Command. Unified CLI for GGUF, GPTQ, and AWQ quantization.
本来叫 nano 的,后来发现装不下 Qwen3.5,就改名叫 big 了
Research Test: REAP expert pruning + AWQ quantization of Qwen3-Coder-Next MoE model
Quantize LLM using AWQ
Quantization quality analyzer - benchmark GGUF/GPTQ/AWQ quantization accuracy.
Dockerized vLLM serving for Kimi-Linear-48B-A3B (AWQ-4bit), from 128K to 1M context.
[ICLR2026] When Reasoning Meets Compression: Understanding the Effects of LLMs Compression on Large Reasoning Models. Support interpretation of Qwen, Llama, etc.
Effortlessly quantize, benchmark, and publish Hugging Face models with cross-platform support for CPU/GPU. Reduce model size by 75% while maintaining performance.
Run Qwen 3.6-27B AWQ-INT4 models with DFlash speculative decoding on AMD Strix Halo hardware using vLLM for high-throughput inference.
White paper & reproducible benchmark suite for LLM inference optimization on AMD MI300X using ROCm 6.1
Add a description, image, and links to the awq topic page so that developers can more easily learn about it.
To associate your repository with the awq topic, visit your repo's landing page and select "manage topics."