Critics Question NIST's DeepSeek V4 Pro Evaluation After Exclusion of US Models

The National Institute of Standards and Technology's CAISI team evaluated China's DeepSeek V4 Pro using private benchmarks — but only after a cost-comparison filter that excluded every US AI model except OpenAI's GPT-5.4 mini. The US government says the results prove China's best AI still trails American counterparts. Not everyone is buying it.

The evaluation setup

NIST's CAISI — its Center for AI Safety and Innovation — ran DeepSeek V4 Pro through a series of private benchmarks. The agency didn't release those benchmarks publicly. What it did disclose was a cost-comparison filter that narrowed the field of competing models to just one US entrant: GPT-5.4 mini. Every other US model was excluded from the comparison.

Why the filter matters

That filter is drawing the most scrutiny. By pitting DeepSeek V4 Pro against only a smaller, cheaper version of GPT-5 — rather than the full lineup of US frontier models — the evaluation creates a narrow playing field. The US government cited the results to assert that China's best AI systems lag behind those developed in the United States. But critics say the methodology makes that claim hard to take at face value.

Experts call the methodology 'convenient'

Some researchers and industry watchers question the validity of the whole exercise. They describe the filter as 'convenient' — a term that suggests the comparison was engineered to produce a favorable outcome for the US side. By excluding competing US models, the evaluation avoids harder comparisons that might show DeepSeek V4 Pro performing closer to — or even on par with — leading American systems. The critics aren't disputing the raw benchmark data. They're disputing whether that data means what the government says it means.

What NIST hasn't said

NIST hasn't explained why it chose the cost-comparison filter or why it excluded all US models except GPT-5.4 mini. The agency also hasn't released the private benchmarks it used, making it impossible for outside researchers to replicate the work. Without that transparency, the government's claim about Chinese AI lagging rests on a methodology that even friendly observers have trouble defending.

The unanswered question: will NIST release the full methodology and benchmarks so the public can judge for itself? So far, silence.

The evaluation setup

Why the filter matters

Experts call the methodology 'convenient'

What NIST hasn't said

Related Articles