Efficient ML Algorithm
Algorithmic methods that cut compute and memory while preserving accuracy — sparsity, low-rank structure, and efficient training & inference for large models.
Frontier Research Lab
We study the algorithms and systems that make large-scale machine learning more efficient — spanning low-bit quantization, KV-cache compression, attention architectures, and hardware-aware ML systems.
Welcome to the Future Machine Learning & Systems (FutureMLS) Lab. We work at the intersection of machine learning and computer systems. Today's foundation models are remarkably capable but costly to train and serve. Our mission is to close the gap between rapidly growing model capability and the real-world cost of deploying these models.
We pursue algorithm–system co-design: quantization and compression methods that preserve accuracy, attention mechanisms that shrink memory footprints, and runtimes that make deployment practical. Our research spans the AI stack — from model compression and KV-cache design to GPU kernels and serving systems — and our work is open-source, reproducible, and built to be used.
Zhongzhu Zhou (Charlie Zhou) is the founder and principal investigator of the Future Machine Learning & Systems Lab. He is a Senior Research Scientist on the Turbo Team at Together AI, and earned his Ph.D. at the School of Computer Science, University of Sydney, advised by Prof. Shuaiwen Leon Song.
His research spans efficient machine learning and systems — from pretraining quality to efficient algorithms and algorithm–system co-design that bridges emerging ML/LLM methods and real-world applications, improving both productivity (usable, robust stacks) and performance (throughput, memory, and cost-efficiency). He received his B.Eng. (Hons) from Sun Yat-sen University, and has interned at Dolby, the DeepSpeed team at Microsoft, and Tencent.
He leads projects across the lab's four themes, including OSCAR (2-bit KV-cache quantization), CARE (covariance-aware Multi-Head Latent Attention), and RUD (an automatic run–evaluate–refine loop).
Four directions, one goal: efficient and capable AI at scale.
Algorithmic methods that cut compute and memory while preserving accuracy — sparsity, low-rank structure, and efficient training & inference for large models.
Systems that make efficient algorithms practical end-to-end — GPU kernels, distributed runtimes, KV-cache design, and high-throughput serving.
Low-bit weight, activation, and KV-cache quantization — spectral and covariance-aware methods that keep accuracy down to 2-bit.
Model and architecture design — attention mechanisms, latent representations, and architectures that trade memory for capability gracefully.
OSCAR crosses 300 ★ on GitHub in its first week — thanks to the open-source community.
OSCAR released — 2-bit KV-cache serving at 2.28 effective bits/element with near-BF16 accuracy on Qwen3 and GLM-4.7.
RUD (Repeat Until Done) is open-sourced — an automatic run–evaluate–refine loop for robust, reproducible ML workflows.
CARE presented at ICLR 2026: covariance-aware, rank-enhanced decomposition for Multi-Head Latent Attention.
A small group of researchers and advisors building in the open.
Selected projects from the lab — open the preview page for details, papers, and code.
2-bit KV-cache quantization · 2026
Attention-aware offline rotations compress the KV cache to 2.28 bits/element — ~8× memory reduction and up to ~7× higher throughput with near-BF16 accuracy.
View project →
Multi-Head Latent Attention · ICLR 2026
Covariance-aware low-rank decomposition that converts pretrained GQA/MHA into MLA — up to 215× lower one-shot perplexity at matched KV budgets.
View project →Repeat Until Done · Open-source
A lightweight run–evaluate–refine loop that wraps any ML job and automatically retries until a success criterion is met — for robust, reproducible workflows.
View project →We welcome collaborators, prospective students, and contributors who care about efficient, open machine learning and systems.