Loading market data...

NVIDIA CUDA 13.3 Adds Tile-Based GPU Programming in C++

NVIDIA CUDA 13.3 Adds Tile-Based GPU Programming in C++

NVIDIA has released CUDA 13.3, a version that introduces tile-based GPU programming directly in C++. The update aims to make better use of Tensor Cores while cutting the complexity of writing kernels.

How tile-based programming works

Tile-based programming breaks a computation into small, fixed-size blocks called tiles. These tiles match the GPU's internal data-flow patterns, making it easier to keep the hardware busy. In earlier versions of CUDA, developers had to manage that mapping themselves. Now the compiler handles it, at least for many common patterns.

Tensor Cores are specialized hardware inside NVIDIA GPUs that accelerate matrix multiply-accumulate operations. They're central to AI training and inference. But getting peak performance out of them often required intricate manual tuning. CUDA 13.3's tile abstraction automatically tiles matrix operations so they hit Tensor Cores more often. Developers write simpler code and still get good throughput.

Kernel development gets simpler

One of the biggest pain points in GPU programming is managing threads, shared memory, and synchronization. The new tile-based model abstracts those details away. Instead of writing a kernel that spawns thousands of threads and coordinates their access to shared memory, a programmer can express the computation as operations on tiles. The CUDA compiler then maps those tiles to the underlying hardware. That should reduce bugs and speed up development, especially for teams new to GPU computing.

Availability

CUDA 13.3 is available now for download from NVIDIA's developer site. It supports all current NVIDIA GPU architectures, including the Hopper and Blackwell lines. Developers can start experimenting with the tile-based API immediately.

The update doesn't deprecate older approaches, but it sets a new default path for writing efficient GPU code. Whether the broader community adopts it quickly will depend on how well the compiler maps tile operations to real hardware — and whether the performance matches hand-tuned kernels in critical workloads.