Loading market data...

Google's DiffusionGemma AI Delivers 1,000 Tokens per Second, but Most PCs Can't Run It

Google's DiffusionGemma AI Delivers 1,000 Tokens per Second, but Most PCs Can't Run It

Google has released a new AI model called DiffusionGemma that cranks out text at 1,000 tokens per second. That speed comes from a design that skips the usual word-by-word generation routine. But here's the catch: the model is free to use, yet it won't run on most consumer hardware.

A different generation method

Most large language models predict one token at a time, chaining them together. DiffusionGemma instead generates a batch of tokens in parallel. The process is closer to how diffusion models create images—starting with noise and refining it into coherent text. That parallel approach is what lets it hit the 1,000 tokens-per-second mark, a number that dwarfs the typical output from GPT-style models running on similar hardware.

Google hasn't said exactly which GPUs it used to hit that speed, but the company clearly designed the model for data-center-grade accelerators. The architecture requires a lot of memory and compute bandwidth—things most laptops and even some desktop workstations lack.

Free to use, but not free to run

DiffusionGemma is open for anyone to download and tinker with. Google released it under a permissive license, no strings attached. That's a departure from some of the company's more restricted models. But the hardware requirement creates a practical barrier. If your machine doesn't have a high-end NVIDIA or AMD GPU with enough VRAM, you're out of luck.

Developers can try running a quantized version on consumer cards, but performance takes a hit. The full-speed experience stays locked inside cloud instances or specialized rigs.

Where the speed matters

Real-time applications—like live captioning, interactive chatbots, or on-the-fly translation—could benefit from the fast inference. A model that spits out 1,000 tokens a second can keep up with human speech rates. That's a big jump over current models, which often lag behind in conversational settings.

Still, the practical use cases will depend on how Google or third parties package the model. If the only way to get the full speed is through a cloud API, then the free model becomes a vehicle for renting compute time.

Google hasn't announced a hosted version yet. Developers are left to figure out whether they can get the model running on their own hardware or if they need to spin up expensive cloud instances.