← All projects Quantization · 2026

OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

Lab-led project INT2 KV-cache serving · SGLang + Triton

Abstract

OSCAR applies attention-aware offline rotations and clipping to compress the long-history KV cache to 2.28 bits per element on Qwen3-4B/8B/32B and GLM-4.7-FP8 — roughly 8× KV-memory reduction and up to ~7× higher large-batch throughput, with a mean accuracy gap to BF16 of just 1.42 / -0.02 / +0.27 on the 8B / 32B / 358B models. It ships an SGLang INT2-serving path with Triton kernels.

Authors

Zhongzhu Zhou*, Donglin Zhuang*, Jisen Li, Ziyan Chen, Shuaiwen Leon Song, Ben Athiwaratkun, Xiaoxia Wu
* Equal contribution

Links

Paper Code ↗ Project page Hugging Face