Case Study: Workday achieves a scalable, low‑maintenance metrics pipeline with Elastic Stack

A Elastic Case Study

Preview of the Workday Case Study

How Workday Search Built our Metrics Pipeline With the Elastic Stack

Workday Search needed a low‑maintenance, scalable way to collect and analyze operational and correctness metrics as it evolved from a monolith to microservices at large scale. Their existing setup (statsd → InfluxDB → Grafana) became a bottleneck as logs and metrics exploded and they faced hard correctness requirements (per‑tenant on‑disk encryption, multi‑datacenter search, and strict reindex/query SLAs).

They replaced InfluxDB with a dedicated Elasticsearch “metrics” cluster fed by Logstash, Marvel and application logs (with a grok/json/ruby parsing pipeline and a Scala StatsLogger for consistent emission), and exposed data to Grafana/Kibana/Nagios. The pipeline now handles billions of points (two‑week totals ~15B logs and ~9.3B metrics), enabled index rotation/curation, drove incremental indexing coverage from ~50% to >90%, achieved 94% of tenants reindexed in <12 minutes (100% <3 hours), and provided correctness metrics (set diffs, Kendall tau) that uncovered and helped fix multi‑shard discrepancies in search results.


Open case study document...

Workday

Bodecker DellaMaria

Software Engineer


Elastic

349 Case Studies