No external monitoring solutions required for autoscaling decisions
Native timeseries support within the ML engine in an embedded capacity
Helps provide fault-tolerance at the software architecture layer
Powering web-scale systems with hundreds of millions of monthly users
Comprehensive documentation accelerated the integration
High-performance solution for real-time resource monitoring
Yahoo’s media, technology, and business platform serves content to hundreds of millions of users per month. To power search and recommendation, Yahoo relies on a custom machine learning engine which serves personalized content in real-time. High-performance is critical at this scale because every extra millisecond a consumer has to wait matters.
In this case study, VP Architect Jon Bratseth describes how and why QuestDB is relied upon within high-performance machine learning engines at Yahoo.
Vespa is an open-source big data processing and serving engine powering applications at Yahoo. Many products, such as Yahoo News, Yahoo Sports, and Yahoo Finance, use this engine to store, search, organize, and make machine-learned inferences over big data at serving time.
These platforms serve close to a billion users, and we needed an embedded solution that monitors resource utilization metrics in nodes within our application clusters. We decided to use QuestDB to store and analyze application monitoring metrics quickly and easily within the application itself, removing external failure modes and leaving plenty of headroom for performance.
The Vespa platform performs machine-learned inferences over big data with high availability and high performance. The engine powers search, question answering for chatbots, recommendation and personalization, and typeahead suggestions for systems that serve close to a billion users at a rate of 500k queries per second. In these use cases, the engine needs to select a subset of data from a vast data corpus, evaluate machine-learned models over the selected data, organize and aggregate it, and return results in less than 100 milliseconds while the data corpus is continuously changing.
We’re running a large number of deployments on behalf of customers. Each consists of several clusters running on dedicated Docker containers. The clusters are controlled by a shared control plane mainly consisting of ’configuration server clusters’ where a single cluster runs per environment and region and handles application configuration.
Autoscaling lets you adjust the hardware resources allocated to application clusters automatically depending on actual usage. You want your application to use as few resources as possible to minimize cost, but at the same time, you don’t want to run out of resources when traffic is high or you feed more data. We want clusters to maintain the optimal allocation required to handle the current load at any time.
We run QuestDB embedded in the admin and configuration clusters to store a few days of resource usage of the nodes. The data is sampled continuously for each cluster or node, and we query the resource utilization data to make scaling decisions based on that data. We use QuestDB to act on the usage patterns observed on the system in the recent past. When users deploy a new cluster (or application), defaults initially configure the minimal resources provided within a given range. When engineers enable autoscaling for a cluster, it will continue unchanged until autoscaling determines that a change is beneficial.
Our ’ideal utilization’ considers that a node may be down or failing, that another region may be down, which can cause a doubling of traffic. We can also factor into the equation that we need headroom for maintenance operations and handling requests with low latency.
We decided that metric collection should be embedded within nodes that manage the clusters. We embed QuestDB to avoid failure scenarios where some parts of the cluster work, but others don’t. Using a time series database within Yahoo’s recommendation engine helps include fault-tolerance measures within the engine’s architecture and allows engineers to make AI-driven decisions using customer data in real-time, at any scale.
“We use QuestDB to monitor metrics for autoscaling decisions within our ML engine that provides search, recommendation, and personalization via models and aggregations on continuously changing data.”
Jon Bratseth, VP Architect at Yahoo