Case Study: GLiNER cuts inference latency by 2x with Pruna AI

A Pruna AI Case Study

Pruna Cuts GLiNER Latency by x2 for the Largest Cloud Monitoring Platform

The customer, a large cloud monitoring-as-a-service platform that cannot be named, faced significant cost and latency challenges after deploying a GLiNER AI model to detect PII in millions of log streams per second. They engaged the vendor, Pruna AI, with a clear goal to reduce the model's latency by at least 10% without compromising on quality.

Pruna AI implemented a multi-faceted optimization strategy, combining techniques like half-precision (float16), model compilation with torch.compile, and targeted pruning. This solution nearly doubled the speed of a critical sub-process, cutting its latency from 35ms to 19ms. The vendor's work resulted in a 2x overall speedup and a 50% reduction in memory usage. For the customer, this translated to significant financial savings, estimated at up to ‚Ç¨28,260 per year for a single feature on a modest GPU setup, while also improving the end-user experience.

View this case study…

GLiNER

Pruna AI

7 Case Studies

Case Study: GLiNER cuts inference latency by 2x with Pruna AI

Pruna Cuts GLiNER Latency by x2 for the Largest Cloud Monitoring Platform

GLiNER

Pruna AI

Was it helpful? Rate this case study:

Thank you for your feedback.