This document describes the options available for monitoring the health of a
QuestDB instance. There are options for minimal health checks via a
which provides a basic 'up/down' check, or detailed metrics in Prometheus format
exposed via an HTTP endpoint.
Prometheus is an open-source systems monitoring and alerting toolkit. Prometheus collects and stores metrics as time series data, i.e. metrics information is stored with the timestamp at which it was recorded, alongside optional key-value pairs called labels.
QuestDB exposes a
/metrics endpoint which provides internal system metrics in
Prometheus format. To use this functionality and get started with example
configuration, refer to the
REST APIs will often be situated behind a load balancer that uses a monitor URL
for its configuration. Having a load balancer query the QuestDB REST endpoints
9000 by default) will cause internal logs to become excessively
noisy. Additionally, configuring per-URL logging would increase server latency.
To provide a dedicated health check feature that would have no performance knock
on other system components, we opted to decouple health checks from the REST
endpoints used for querying and ingesting data. For this purpose, a
server runs embedded in a QuestDB instance and has a separate log and thread
The configuration section for the
min HTTP server is available in the
minimal HTTP server reference.
min server is enabled by default and will reply to any
HTTP GET request
The server will respond with an HTTP status code of
200, indicating that the
system is operational:
Path segments are ignored which means that optional paths may be used in the URL and the server will respond with identical results, e.g.:
/metrics path segment is reserved for metrics exposed in Prometheus
format. For more details, see the
When metrics subsystem is
on the database, the health endpoint checks the occurrences of unhandled,
critical errors since the database start and, if any of them were detected, it
returns HTTP 500 status code. The check is based on the
When metrics subsystem is disabled, the health check endpoint always returns HTTP 200 status code.
On systems with 8 Cores and less, contention for threads might increase the latency of health check service responses. If you are in a situation where a load balancer thinks QuestDB service is dead with nothing apparent in QuestDB logs, you may need to configure a dedicated thread pool for the health check service. For more reference, see the minimal HTTP server configuration.