Why inference is the real gold rush in AI

AI models aren’t just answering questions anymore — they’re thinking through them. Each response now involves layers of reasoning, tools, and follow-ups. That deeper logic comes with a price: running these models, known as inference, is becoming one of the biggest forces driving compute costs in AI.

A new independent benchmark, InferenceMAX v1, is the first to measure the total cost of compute across real-world scenarios. The results show NVIDIA’s Blackwell platform far ahead of the pack — delivering strong performance and top-tier efficiency for large-scale AI operations.

According to the analysis, a $5 million NVIDIA GB200 NVL72 system could generate about $75 million in token revenue, a 15x return on investment — the kind of math that’s reshaping how companies think about AI inference infrastructure.

“Inference is where AI delivers value every day,” said Ian Buck, vice president of hyperscale and high-performance computing at NVIDIA. “These results show that NVIDIA’s full-stack approach gives customers the performance and efficiency they need to deploy AI at scale.”

What InferenceMAX measures — and why it matters

Released by SemiAnalysis, InferenceMAX v1 tests popular AI models on multiple platforms and evaluates their performance across a range of workloads. The results are transparent and reproducible, offering a rare look into the real-world economics of AI computing.

Benchmarks like this are more than bragging rights. As generative AI moves toward multi-step reasoning and tool use, models are producing far more tokens per query — multiplying compute costs and making efficiency the new competitive edge.

NVIDIA’s deep ties with the open-source community play a big role here. Collaborations with OpenAI (gpt-oss 120B), Meta (Llama 3 70B), and DeepSeek AI (DeepSeek R1) have helped optimise model performance. Partnerships with developers behind FlashInfer, SGLang, and vLLM have led to kernel and runtime improvements that push these models to new speeds.

Software tweaks keep the gains coming

Hardware isn’t the only story. NVIDIA keeps finding new performance gains through hardware-software co-design — refining both layers together.

Its TensorRT LLM library, used with DGX Blackwell B200 systems, already pushed open-source large language models to new limits. The recent TensorRT LLM v1.0 update took it further, improving parallelisation and using NVLink Switch’s 1,800 GB/s bandwidth to boost throughput.

The gpt-oss-120b-Eagle3-v2 model also introduces speculative decoding, which predicts several tokens at once. The payoff: faster response times and up to 30,000 tokens per GPU — five times more than before.

Dense models like Llama 3.3 70B, which process every parameter during inference, also benefit. On the Blackwell B200 GPU, they hit over 10,000 tokens per second per GPU — four times higher than NVIDIA’s older H200.

Efficiency is the new performance metric

Throughput isn’t the only number that matters. For large AI data centres, efficiency metrics like tokens per watt and cost per million tokens can make or break profitability.

Here, the Blackwell platform stands out. It delivers 10x more throughput per megawatt compared to the previous generation, while cutting cost per million tokens by 15x. For operators running massive inference workloads, that translates to lower costs and higher margins.

Performance that scales

InferenceMAX uses what’s known as a Pareto frontier — a way of visualising the best trade-offs between factors like throughput, energy use, and responsiveness. On this curve, NVIDIA’s Blackwell platform consistently sits on the efficient edge, balancing speed and cost in production settings.

That balance matters because optimising for one variable — say, raw speed — often hurts others, like cost or energy draw. The Blackwell stack was built to keep all those pieces aligned, delivering performance that actually scales in the real world.

Inside Blackwell’s design

What gives Blackwell its edge is how tightly hardware and software are built to work together. The architecture uses the NVFP4 precision format, which improves efficiency without sacrificing accuracy, and a fifth-generation NVLink that links up to 72 GPUs so they perform as one massive processor. The NVLink Switch manages parallel workloads across tensors, experts, and data streams, helping the system handle high concurrency without slowing down.

Since launch, NVIDIA says ongoing software optimisations have already doubled Blackwell’s performance, showing how much progress can come from updates alone. That’s supported by open frameworks such as TensorRT-LLM, NVIDIA Dynamo, SGLang, and vLLM, all tuned for top inference performance. Underpinning it all is a huge developer ecosystem — more than 7 million CUDA developers contributing to over 1,000 open-source projects — which keeps the platform evolving as new workloads emerge.

From AI experiments to AI factories

The AI industry is shifting from pilots to AI factories — infrastructure built to turn data into tokens, predictions, and business decisions in real time.

Open, transparent benchmarks like InferenceMAX help teams pick the right hardware, control costs, and plan for service-level targets as workloads grow. NVIDIA’s Think SMART framework aims to guide enterprises through this phase — where inference performance isn’t just a technical metric, but a financial one.

In today’s AI inference race, speed matters — but efficiency decides who stays ahead.

(Photo by 🇻🇪 Jose G. Ortega Castro 🇲🇽)

See also: KAI Scheduler: NVIDIA open-sources Kubernetes GPU scheduler

Want to dive deeper into the tools and frameworks shaping modern development? Check out the AI & Big Data Expo, taking place in Amsterdam, California, and London. Explore cutting-edge sessions on machine learning, data pipelines, and next-gen AI applications. The event is part of TechEx and co-located with other leading technology events. Click here for more information.

DeveloperTech News is powered by TechForge Media. Explore other upcoming enterprise technology events and webinars here.

Why inference is the real gold rush in AI

What InferenceMAX measures — and why it matters

Software tweaks keep the gains coming

Efficiency is the new performance metric

Performance that scales

Inside Blackwell’s design

From AI experiments to AI factories

About Sparklex

Sparklex Technologies leverages the latest technologies to create bespoke solutions tailored to meet the unique needs of each client.

Quick Links

Policies