Kimi K2.5 Runs on an RTX 3060 at 4 Tokens Per Second With 768GB Intel Optane Memory

Developers have managed to run the Kimi K2.5 large language model on a modest consumer graphics card, an RTX 3060, by pairing it with a massive 768GB Intel Optane memory setup. The system achieves a steady 4 tokens per second — slow but usable for local inference tasks.

The hardware combo that made it possible

The RTX 3060 is a mid-range GPU from Nvidia's 30-series, not typically used for running models this size. The key enabler is the 768GB of Intel Optane persistent memory. Optane sits between DRAM and SSD speeds, allowing the model to be loaded without requiring expensive server-grade video memory. The system essentially uses the GPU for compute and the Optane as a large, fast memory pool.

What 4 tokens per second means in practice

At that rate, a user can expect roughly one short sentence every few seconds. It's not conversational — more like a slow typist. But for offline or private use cases, or for long-form text generation where latency isn't critical, it's functional. The trade-off is clear: massive memory capacity versus speed.

Running large models on consumer hardware has been a goal for many developers and hobbyists. Most high-end models require data center GPUs. This demonstration shows that with enough Optane memory, even a budget gaming GPU can inference a model as large as Kimi K2.5. It's not a record, but it's a practical proof of concept. The same approach could be applied to other models, though performance will vary.

The developers haven't released detailed instructions or benchmarks for other hardware yet. Whether this configuration can be replicated with different GPUs or cheaper memory remains an open question. For now, it's a notable data point in the push to bring large language models to ordinary computers.

The hardware combo that made it possible

What 4 tokens per second means in practice

Related Articles