NVIDIA's NVFP4 Boosts Llama Training by 73% on Blackwell GPUs

NVIDIA has introduced a new 4-bit precision training format, NVFP4, designed for its Blackwell GPU family. The company says the approach accelerates Llama model training by up to 73% with no loss in accuracy — a leap that could reshape how large language models are built.

How NVFP4 works

Traditional training uses 16- or 32-bit floating point numbers, which demand huge memory and compute. NVFP4 cuts that to 4-bit precision while maintaining the dynamic range needed for gradient updates. The trick lies in a non-linear scaling scheme that packs more information into fewer bits. Unlike standard quantization methods that often degrade model quality, NVFP4 is designed from the ground up for the training phase, not just inference.

Blackwell GPUs include dedicated hardware units that accelerate NVFP4 operations. NVIDIA claims this hardware-software pairing is what makes the 73% throughput gain possible without retuning hyperparameters or altering model architectures.

Testing on Llama models

NVIDIA ran benchmarks using open-source Llama models of varying sizes. In those tests, NVFP4 delivered the claimed speedup while producing final loss values nearly identical to full-precision training. The company says this holds for both dense and sparse model variants, though it hasn't published detailed error analysis or cross-model comparisons beyond Llama.

The efficiency gain means a training run that used to take a week could finish in under two days on the same number of Blackwell GPUs. That could lower the cost and energy required to develop large language models — a key concern as AI scales.

Why accuracy stays intact

4-bit precision typically loses information, especially for small gradient values. NVFP4 counters this by reserving more exponent bits than a standard 4-bit float, allowing it to represent a wider range of numbers. The mantissa is compressed using a lookup table that maps common values efficiently. The result: the network sees almost the same gradients as in 16-bit training, so the model converges to the same solution.

NVIDIA also points to its adaptation of stochastic rounding, which injects controlled noise during weight updates. This technique helps offset the reduced bit width and prevents the model from getting stuck in local minima.

The company hasn't said when NVFP4 will be broadly available on Blackwell GPUs or whether it will be open-sourced. For now, developers working with Llama models and Blackwell hardware can expect a significant speed bump — if they're willing to trust the 4-bit path.

How NVFP4 works

Testing on Llama models

Why accuracy stays intact

Related Articles