Case Study: OpenAI achieves 1M data points/week with low-latency, high-quality human preference labeling from Scale AI

A Scale AI Case Study

Preview of the OpenAI Case Study

How to Label 1M Data Points/Week

OpenAI needed fast, high-quality human preference labels to fine-tune GPT-2 in a closed-loop setup where the model generated samples, humans rated them, and the model retrained—requiring very low latency (<30 minutes) and high throughput (~5,000 labels/hour) for subjective tasks. To collect these preferences at scale, OpenAI worked with Scale AI and faced the challenge of maintaining label quality without the usual redundancy and slow staging processes.

Scale AI built an automated benchmark-mining system that used trusted labelers to create hidden “golden” tasks, dynamically served benchmarks to detect quality dips, weighted anti-bias examples, ran synthetic projects for safe testing, and set up monitoring and audits. This approach let Scale AI meet OpenAI’s latency and throughput targets (supporting roughly 5,000 labels/hour and enabling labeling at the scale of ~1M data points/week), filter out low-quality or malicious workers, and incorporate the system across its NLP pipelines while improving its labeler-quality classifier.


Open case study document...

Scale AI

31 Case Studies