Case Study: OpenAI scales machine learning experiments faster and at lower cost with Kubernetes

A Kubernetes Case Study

Preview of the OpenAI Case Study

Launching and Scaling Up Experiments, Made Simple

OpenAI, the San Francisco-based artificial intelligence research lab, needed a way to run and scale machine learning experiments quickly, portably, and at low cost across cloud and on-prem environments. To solve this, OpenAI adopted **Kubernetes** for managing deep learning workloads and batch scheduling, using it as a workload manager across distributed containerized experiments.

With **Kubernetes**, OpenAI first ran clusters on AWS, then migrated them to Azure, and later deployed hybrid clusters with control planes in Azure and nodes in its own data centers. The result was greater portability, lower costs, and much faster experimentation: researchers were able to launch projects in just two or three days and scale them to hundreds of GPUs within one to two weeks, versus months previously, while some teams saw a 10x increase in scale without significant engineering overhead.


Open case study document...

OpenAI

Christopher Berner

Head of Infrastructure


Kubernetes

55 Case Studies