Case Study: NVIDIA achieves 4.7× inference throughput and ~95% GPU utilization with NVIDIA Run:ai

A NVIDIA Run:ai Case Study

Preview of the NVIDIA Case Study

Efficient compute orchestration yields more productive AI

NVIDIA faced the challenge of managing and allocating hundreds of on‑prem NVIDIA DGX systems and T4 GPUs (>200) for more than 100 researchers in a fully air‑gapped defense environment, where inference workloads demanded extremely low latency and maximum throughput. They engaged NVIDIA Run:ai to deploy Run:ai’s Kubernetes‑based orchestration (Atlas) together with NVIDIA Triton Inference Server and NVIDIA GPU tooling to simplify shared cluster management and dynamic GPU allocation.

NVIDIA Run:ai pooled GPUs into logical build/train/inference pools and applied advanced scheduling, gang scheduling and topology awareness so jobs automatically received the right compute and memory. The solution drove GPU utilization to ~95–100%, created a private managed GPU cloud for on‑demand access by the research teams, and boosted inference throughput by 4.7x while accelerating training turnaround and improving capacity planning.


Open case study document...

NVIDIA Run:ai

7 Case Studies