Case Study: OpenAI scales machine learning experiments faster and at lower cost with Kubernetes

Launching and Scaling Up Experiments, Made Simple

OpenAI, the San Francisco-based artificial intelligence research lab, needed a way to run and scale machine learning experiments quickly, portably, and at low cost across cloud and on-prem environments. To solve this, OpenAI adopted **Kubernetes** for managing deep learning workloads and batch scheduling, using it as a workload manager across distributed containerized experiments.

With **Kubernetes**, OpenAI first ran clusters on AWS, then migrated them to Azure, and later deployed hybrid clusters with control planes in Azure and nodes in its own data centers. The result was greater portability, lower costs, and much faster experimentation: researchers were able to launch projects in just two or three days and scale them to hundreds of GPUs within one to two weeks, versus months previously, while some teams saw a 10x increase in scale without significant engineering overhead.

Open case study document...

OpenAI

Christopher Berner

Head of Infrastructure

Kubernetes

55 Case Studies

Case Study: OpenAI scales machine learning experiments faster and at lower cost with Kubernetes

Launching and Scaling Up Experiments, Made Simple

OpenAI

Kubernetes

Was it helpful? Rate this case study:

Thank you for your feedback.