Case Study: GLiNER cuts inference latency by 2x with Pruna AI

A Pruna AI Case Study

Preview of the GLiNER Case Study

Pruna Cuts GLiNER Latency by x2 for the Largest Cloud Monitoring Platform

The customer, a large cloud monitoring-as-a-service platform that cannot be named, faced significant cost and latency challenges after deploying a GLiNER AI model to detect PII in millions of log streams per second. They engaged the vendor, Pruna AI, with a clear goal to reduce the model's latency by at least 10% without compromising on quality.

Pruna AI implemented a multi-faceted optimization strategy, combining techniques like half-precision (float16), model compilation with torch.compile, and targeted pruning. This solution nearly doubled the speed of a critical sub-process, cutting its latency from 35ms to 19ms. The vendor's work resulted in a 2x overall speedup and a 50% reduction in memory usage. For the customer, this translated to significant financial savings, estimated at up to €28,260 per year for a single feature on a modest GPU setup, while also improving the end-user experience.


View this case study…

Pruna AI

7 Case Studies