NVIDIA TensorRT Adds FP8 Quantization for Faster AI Inference

NVIDIA has added FP8 quantization support to its TensorRT software development kit, a move that speeds up AI inference and shrinks model sizes for large-scale deployment. The technique, which reduces the numerical precision of neural network calculations, is now available to developers using TensorRT to optimize their models on NVIDIA GPUs.

What FP8 Quantization Brings

Quantization maps a model's weights and activations — typically stored as 32-bit floating-point numbers — into a narrower range of values. FP8 uses just 8 bits, cutting memory use roughly in half compared to 16-bit formats and slashing it even more against full 32-bit precision. That memory saving translates directly to faster computation: smaller data means less bandwidth to move and fewer cycles to process.

NVIDIA's TensorRT already supported INT8 (integer 8-bit) quantization, but FP8 keeps the floating-point format, which can better preserve accuracy for certain layers and activation functions. The company describes the new option as a way to balance performance and model fidelity, especially for generative AI and large language models where even small precision losses can degrade output quality.

Scalable Deployment Gains

For companies running AI in production, smaller models mean more inferences per second on the same hardware — or the ability to run the same workload on fewer GPUs. That cuts both latency and cost. FP8 also simplifies the deployment pipeline because developers can skip some of the manual calibration steps often required with INT8 quantization.

The update arrives as the industry pushes toward lower-precision arithmetic to keep up with exploding model sizes. FP8 sits between 16-bit and 8-bit integer formats, offering a middle ground that's been gaining traction in recent hardware designs, including NVIDIA's Hopper and Blackwell architectures. TensorRT now lets developers take advantage of that hardware support without rewriting their models from scratch.

TensorRT itself is a compiler and runtime that optimizes neural networks for NVIDIA GPUs. It fuses layers, prunes unused operations, and selects the most efficient kernels for each target device. Adding FP8 to that toolkit means developers can choose precision down to the layer level, mixing FP8 with other formats as needed.

The move doesn't require any changes to the underlying model code. Developers simply enable the FP8 flag in the TensorRT builder and let the tool handle the conversion. That lowers the barrier for teams that want faster inference without diving into low-level GPU programming.

NVIDIA hasn't disclosed performance benchmarks for specific models under FP8 quantization. The company typically publishes such numbers alongside driver releases or TensorRT version notes. Developers can test the feature now by downloading the latest TensorRT version from NVIDIA's developer portal.

What FP8 Quantization Brings

Scalable Deployment Gains

Related Articles