awq

Here are 50 public repositories matching this topic...

intel / neural-compressor

SOTA low-bit LLM quantization (INT8/FP8/MXFP8/INT4/MXFP4/NVFP4) & sparsity; leading model compression techniques on PyTorch, TensorFlow, and ONNX Runtime

sparsity pruning quantization knowledge-distillation auto-tuning int8 low-precision quantization-aware-training post-training-quantization awq int4 large-language-models gptq smoothquant sparsegpt fp4 mxformat

Updated Jul 2, 2026
Python

ModelTC / LightCompress

Star

[EMNLP 2024 & AAAI 2026] A powerful toolkit for compressing large models including LLMs, VLMs, and video generative models.

benchmark deployment tool evaluation pruning quantization wan awq large-language-models llm token-pruning vllm smoothquant token-reduction mixtral internlm2 token-merging deepseek-v3

Updated May 14, 2026
Python

SNDR Core Engine (Genesis) — vLLM runtime patch-overlay for Qwen3.6 + Gemma4 on consumer NVIDIA (Ampere sm_86, 2× A5000/3090). Qwen3.6-35B-A3B FP8 ~240 tok/s, 27B-int4 hybrid GDN+Mamba, Gemma4 26B/31B AWQ, 256K ctx. 321 patches: TurboQuant k8v4 KV, MTP/DFlash spec-decode, FULL cudagraph, hybrid GDN. vLLM pin dev424 + Control Center GUI.

Updated Jul 5, 2026
Python

hec-ovi / vllm-awq4-qwen

Star

vLLM Qwen 3.6-27B (AWQ-INT4) + DFlash speculative decoding on AMD Strix Halo (gfx1151 iGPU, 128 GB UMA, ROCm 7.13). 24.8 t/s single-stream, vision, tool calling, 256K context, OpenAI-compatible, Docker. Matches DGX Spark FP8+DFlash+MTP at a third of the cost. No CUDA.

docker rocm openai-api awq vllm llm-inference speculative-decoding multimodal-llm qwen3 gfx1151 ryzen-ai-max dflash amd-strix-halo rdna35 27b

Updated May 10, 2026
Python

ncoder-ai / VibeVoice-FastAPI

Star

FastAPI wrapper around original Vibevoice 1.5B and 7B models, with support for AWQ4 quant

tts-api fastapi awq vibevoice-microsoft vibevoice-large

Updated Jun 22, 2026
Python

aivrar / vllm-windows-build

Sponsor

Star

Native Windows build of vLLM 0.24.0 - no WSL, no Docker. Python 3.13 + CUDA 12.8 + PyTorch 2.11 cu128 for RTX 30/40/50-series, pre-built wheel, Windows patchset, 10 KV-cache compression dtypes, OpenAI API server fixes, Rust frontend, and Rust tool parser support.

Updated Jul 3, 2026
Python

hcd233 / Aris-AI-Model-Server

Star

An OpenAI Compatible API which integrates LLM, Embedding and Reranker. 一个集成 LLM、Embedding 和 Reranker 的 OpenAI 兼容 API

ai embedding mlx reranker rag fastapi sentence-transformers awq llm vllm gptq openai-compatible-api

Updated Aug 21, 2025
Python

AEON-7 / supergemma4-26b-abliterated-multimodal-nvfp4

Star

NVFP4 AWQ Full quantization of SuperGemma4-26B-Abliterated-Multimodal for Blackwell GPUs — pre-built vLLM container + patches included

moe quantization multimodal blackwell awq llm vllm nvfp4 dgx-spark gemma4 modelopt

Updated May 1, 2026
Python

BoundlessWindMoon / minivllm

Star

A light, transparent, and modular inference & quantization engine for studying LLMs.

framework inference awq multi-backends quantum-kernel cuda-graph megakernel

Updated Jun 4, 2026
Cuda

ShipItAndPray / turboquant

Star

Compress Any LLM Up to 6x in One Command. Unified CLI for GGUF, GPTQ, and AWQ quantization.

quantization model-compression awq llm llama-cpp vllm gptq ollama gguf

Updated Mar 25, 2026
Python

harleyszhang / harleyszhang.github.io

Star

🧗‍♂️ harleyszhang 的个人博客

blog awq llm llm-inference

Updated May 10, 2026
HTML

duchengyao / big-vllm

Star

本来叫 nano 的，后来发现装不下 Qwen3.5，就改名叫 big 了

python cuda quantization awq vllm gptq llm-inference qwen llm-compressor

Updated May 6, 2026
Python

mtecnic / research-test-Qwen3-Coder-Next-REAP-AWQ

Star

Research Test: REAP expert pruning + AWQ quantization of Qwen3-Coder-Next MoE model

python machine-learning research ai deep-learning optimization transformers moe pruning quantization model-compression mixture-of-experts awq llm

Updated Apr 4, 2026
Python

GURPREETKAURJETHRA / Quantize-LLM-using-AWQ

Star

Quantize LLM using AWQ

quantize awq large-language-models llms generative-ai llm-training

Updated Apr 26, 2024
Jupyter Notebook

stef41 / quantbenchx

Star

Quantization quality analyzer - benchmark GGUF/GPTQ/AWQ quantization accuracy.

python benchmarking quantization awq llm gptq gguf

Updated Apr 11, 2026
Python

neosun100 / kimi-linear-vllm-docker-serve

Star

Dockerized vLLM serving for Kimi-Linear-48B-A3B (AWQ-4bit), from 128K to 1M context.

docker awq long-context llm-serving vllm kimi-linear

Updated Jun 29, 2026
Python

psunlpgroup / Compression-Effects

Star

[ICLR2026] When Reasoning Meets Compression: Understanding the Effects of LLMs Compression on Large Reasoning Models. Support interpretation of Qwen, Llama, etc.

pruning quantization distillation awq llm mechanistic-interpretability gptq llm-compression

Updated May 6, 2026
Python

lpalbou / model-quantizer

Star

Effortlessly quantize, benchmark, and publish Hugging Face models with cross-platform support for CPU/GPU. Reduce model size by 75% while maintaining performance.

python nlp machine-learning cross-platform optimization transformers inference pytorch quantization model-compression huggingface awq llm gptq bitsandbytes cpu-compatible

Updated Mar 15, 2025
Python

aphroditeformal93 / vllm-awq4-qwen

Star

Run Qwen 3.6-27B AWQ-INT4 models with DFlash speculative decoding on AMD Strix Halo hardware using vLLM for high-throughput inference.

docker rocm openai-api awq vllm llm-inference speculative-decoding multimodal-llm qwen3 gfx1151 ryzen-ai-max dflash amd-strix-halo rdna35 27b

Updated Jul 5, 2026
Python

MayurVijayPatil / amd-llm-rocm

Star

White paper & reproducible benchmark suite for LLM inference optimization on AMD MI300X using ROCm 6.1

benchmark amd hip quantization rocm awq vllm llm-inference mi300x flashattention

Updated Apr 17, 2026
Jupyter Notebook

Improve this page

Add a description, image, and links to the awq topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the awq topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

awq

Here are 50 public repositories matching this topic...

intel / neural-compressor

ModelTC / LightCompress

Sandermage / sndr_core_engine

hec-ovi / vllm-awq4-qwen

ncoder-ai / VibeVoice-FastAPI

aivrar / vllm-windows-build

hcd233 / Aris-AI-Model-Server

AEON-7 / supergemma4-26b-abliterated-multimodal-nvfp4

BoundlessWindMoon / minivllm

ShipItAndPray / turboquant

harleyszhang / harleyszhang.github.io

duchengyao / big-vllm

mtecnic / research-test-Qwen3-Coder-Next-REAP-AWQ

GURPREETKAURJETHRA / Quantize-LLM-using-AWQ

stef41 / quantbenchx

neosun100 / kimi-linear-vllm-docker-serve

psunlpgroup / Compression-Effects

lpalbou / model-quantizer

aphroditeformal93 / vllm-awq4-qwen

MayurVijayPatil / amd-llm-rocm

Improve this page

Add this topic to your repo