27 C
Philippines

NVIDIA Blackwell Ultra Sets New Records in MLPerf Inference Debut

Must read

SANTA CLARA, Calif. — Sept. 9, 2025 — NVIDIA today announced that its new Blackwell Ultra architecture has set multiple inference performance records in its first appearance on the MLPerf Inference v5.1 benchmark suite, just six months after the original Blackwell architecture debuted.

As large language models (LLMs) expand into the hundreds of billions of parameters and generate more intermediate reasoning tokens, demand for compute power continues to rise. NVIDIA’s Blackwell Ultra delivers the performance needed for this new era of reasoning models.

Benchmark Highlights

MLPerf Inference v5.1 introduced several new tests this round:

  • DeepSeek-R1 (671B MoE model): Blackwell Ultra set new records, delivering up to 5,842 tokens/second per GPU offline and 2,907 tokens/second per GPU in server scenarios — more than 5x faster than Hopper-based systems.

  • Llama 3.1 405B: Achieved strong results in a new interactive scenario requiring faster response times, with up to 138 tokens/second per GPU.

  • Llama 3.1 8B: Replaced GPT-J in the suite, with throughput reaching 18,370 tokens/second per GPU offline.

  • Whisper (speech recognition): Recorded 5,667 tokens/second per GPU, reinforcing Blackwell’s versatility beyond text-based models.

NVIDIA’s submissions also set records across established benchmarks including Llama 2 70B, Stable Diffusion XL, Mixtral 8x7B, DLRMv2, R-GAT, and Retinanet.

Blackwell Ultra Architecture

The GB300 NVL72 rack-scale system, the first platform powered by Blackwell Ultra, delivered up to 45% higher per-GPU performance than the Blackwell-based GB200 NVL72 on comparable workloads. Key improvements include:

  • 1.5x higher peak NVFP4 AI compute

  • 2x higher attention-layer compute

  • 1.5x larger HBM3e memory capacity

When compared with unverified Hopper system results, Blackwell Ultra demonstrated about 5x higher throughput per GPU on DeepSeek-R1.

Full-Stack Innovations

Performance gains were enabled by NVIDIA’s end-to-end AI platform, including:

  • NVFP4 quantization for DeepSeek-R1, reducing model size while maintaining accuracy.

  • FP8 key-value cache optimization, cutting memory use and boosting speed.

  • Expert and attention data parallelism (ADP Balance) to efficiently distribute workloads across GPUs.

  • CUDA Graphs to minimize CPU overhead during inference.

  • Disaggregated serving for Llama 3.1 405B, separating context and generation across GPUs for more than 5x throughput improvements over Hopper-based systems.

These advances were further supported by NVIDIA’s Dynamo inference framework, enabling SLA-based autoscaling, real-time observability, and improved fault tolerance.

Expanding AI Infrastructure

The results reinforce NVIDIA’s leadership across diverse AI workloads, from reasoning LLMs to speech and vision models. Alongside Blackwell Ultra, NVIDIA also introduced Rubin CPX, a new processor designed to accelerate long-context processing.

- Advertisement -spot_imgspot_imgspot_imgspot_img

More articles

Jobin-SQM, Inc. (JSI), a subsidiary of Nickel Asia Corporation’s (NAC) clean energy arm Emerging Power, Inc. (EPI) highlighted the crucial role of the Ayta...
President Ferdinand R. Marcos Jr. on Friday led the distribution of livelihood assistance to Aeta farming communities in Porac, Pampanga, reaffirming his administration’s commitment...
President Ferdinand R. Marcos Jr. on Friday visited Katutubo Village Elementary School in Porac, Pampanga, to oversee the distribution of government assistance and the...
- Advertisement -spot_img

Latest articles

- Advertisement -spot_img