Researchers bake 3x inference speedups directly into LLM weights without speculative decoding
A team from the University of Maryland, Lawrence Livermore, Columbia, and TogetherAI developed a method to bake multi-token prediction into model weights, achieving 3x throughput gains with a single special token added to the existing architecture. Unlike speculative decoding, it requires no separate drafting model or additional infrastructure.
The approach addresses a critical bottleneck for agentic AI workflows, where reasoning models generate thousands of chain-of-thought tokens before producing final responses. A co-author said latency is becoming equally important as gross throughput as ultra-long thinking traces become the norm.
View full digest for February 24, 2026