Loading market data...

Google's Multi-Token Prediction Drafters Speed Up Gemma 4 by 3x on Local Devices

Google's Multi-Token Prediction Drafters Speed Up Gemma 4 by 3x on Local Devices

Google has developed a technique called Multi-Token Prediction drafters that can make its Gemma 4 large language model run up to three times faster on local hardware — no new equipment or cloud connection needed. The company says the speed boost comes with no loss in output quality, a rare claim in the world of AI optimization.

Inside the new optimization method

The drafters work by predicting multiple tokens at once rather than one at a time, a departure from the standard autoregressive generation used by most LLMs. That parallelization lets the model produce text faster while using the same underlying hardware. Google says the technique is compatible with existing consumer devices, meaning users won't need to upgrade their computers or phones to see the benefit.

Performance without compromise

Typical speed-ups involve trade-offs — reducing the number of parameters, using lower-precision arithmetic, or pruning less important connections. Those methods often degrade accuracy or coherence. Google insists its Multi-Token Prediction drafters maintain the same quality as the standard Gemma 4 even with the threefold speed increase. Independent verification is still pending, but the company's internal benchmarks show no significant difference in output.

What faster local inference means

Running large models locally has long been a challenge because of the computational demands. Cloud-based AI services handle the heavy lifting, but they require a stable internet connection and raise privacy concerns. Faster local inference could change that. Developers building applications on Gemma 4 might be able to offer real-time responses without sending user data to remote servers. The drafters could also reduce power consumption, though Google hasn't released specific figures on that front.

The technique is part of Google's broader effort to make its models more efficient on consumer hardware. Whether developers adopt it widely will depend on how easily the drafters can be integrated into existing workflows. Google hasn't announced a release date for the Multi-Token Prediction drafters, but the company is expected to share more details at its upcoming developer conference.