New DiffusionGemma Model Generates Text at 1,000 Tokens Per Second

A new approach to text generation promises to change how fast AI can write. Called DiffusionGemma, the model can generate text in parallel, hitting speeds up to 1,000 tokens per second on NVIDIA GPUs. That's orders of magnitude faster than most current language models, which produce one word at a time.

The engineering behind the speed

Typical large language models are autoregressive — they predict the next token based on the previous one. That sequential process is a bottleneck. DiffusionGemma skips that. Instead, it generates whole blocks of text simultaneously, using a diffusion technique originally developed for image generation. The result: a single GPU can spit out a thousand tokens in a second.

The researchers who built the model haven't released full technical details yet. But the name is a hint — it combines "diffusion" with "Gemma", a family of open models from Google. DiffusionGemma likely adapts that architecture for parallel decoding.

For anyone building real-time AI apps — chatbots, code assistants, live translation tools — latency is a constant problem. Delays of even a few seconds can ruin user experience. DiffusionGemma's speed could slash that wait nearly to zero. Developers could run the model on their own NVIDIA hardware rather than relying on cloud APIs, cutting costs and improving privacy.

It's not all upside. Parallel generation often trades quality for speed. The model may produce less coherent or less accurate text than slower, autoregressive alternatives. The team hasn't shared benchmarks on accuracy or safety.

Hardware requirements

The speed claim comes with a catch: you need an NVIDIA GPU. The model's architecture is optimized for those chips. That limits deployment in some settings, like phones or older servers. But for data-center use or high-end workstations, it's a clear path to faster inference.

Whether the model will be open-source or available under license hasn't been announced. If it follows the Gemma pattern, it could be released under a permissive license. That would let the community test and tweak it.

The 1,000-token figure is a peak — actual speeds will vary depending on batch size, sequence length, and GPU model. Still, any move toward parallel text generation is a notable shift in the AI landscape.

The engineering behind the speed

Hardware requirements

Related Articles