This comparison comes from Yitaek Hwang, who has put together his thoughts on the features and functionality you should look out for when evaluating time-series databases. This overview takes a look at QuestDB, TimescaleDB, and InfluxDB for features, functionality, maturity, and performance.
Post edited in March 2023 to reflect the latest updates
#Why time-series databases are popular
We're living in the golden age of databases, as money flows into the industry at historical rates (e.g., Snowflake, MongoDB, Cockroach Labs, Neo4j). The last decade saw an explosion of purpose-built databases designed for use cases in which a conventional RDBMS might not be the best fit. According to DB-Engines, an initiative to collect and present information on database management systems, time-series databases have been one of the fastest growing technologies in the past few years:
#Why use a time-series database?
Time-series databases (TSDB) are databases optimized to ingest, process, and store timestamped data. Such data may include metrics from servers and applications, readings from IoT sensors, user interaction on a website or an app, or trading activity on financial markets.
The following properties usually characterize time-series workloads:
- Each data point includes a timestamp used to index, aggregate, and sample. The data can also be multi-dimensional and correlated.
- High write speed (ingestion) is preferred to capture data at high frequencies.
- Summarized view of the data (e.g., downsampled or aggregated view, trend lines) may provide more insight than a single data point. For example, given network unreliability or outliers in sensor readings, we may set alerts when some average value over time exceeds a threshold rather than doing so on a single data point.
- Analyzing the data usually requires accessing it over some window of time (e.g., give me the click rate data over the past week).
While other databases can also handle time-series data to a certain extent, TSDBs are designed with the above properties to handle data ingestion, storage, and aggregation over time more efficiently. So as the demand for time series data continues to explode on the heels of cloud computing, IoT, financial market data, and machine learning, how should architects go about choosing a time-series database?
First released in 2013, InfluxDB is the market leader in the TSDB space, overtaking Graphite and OpenTSDB that came before. As with many OSS database companies, InfluxDB is licensed with MIT License for a single node, with paid-plans available for InfluxDB Cloud and InfluxDB enterprise that provide clustering and other production-ready features.
Previously, the InfluxDB Platform consisted of the TICK stack: Telegraf (agent for collecting and reporting metrics), InfluxDB (the database engine), Chronograf (web interface to InfluxDB), and Kapacitor (background processing). Since version 2.x InfluxDB is installed as a single binary that bundles the database, the UI, and the background processor. If you want to use Telegraf, it is still supported as a separate binary.
At the time of this writing (March 2023), InfluxDB offers additional versions with different query engines and different capabilities. We are referring here to Open Source InfluxDB 2.6.
To understand how InfluxDB works, we need to grasp the following key concepts:
- Data model (tagset model): Besides the timestamp field, each data element consists of various tags (optional, indexed metadata fields), fields (key and value), and measurement (container for tags, fields, and timestamps). The example below takes census data from bees and ants, collected in Klamath and Portland by scientists Anderson and Mullen. Here location and scientist are tags, falling under the census measurement with field/value pair for bees and ants.
Schemaless Design: InfluxDB does not enforce a schema before ingesting data. Instead, the schema is automatically inferred from the tags and fields sent over the line protocol. This gives you great flexibility when writing your applications. In any case, it is possible to enforce schemas for specific buckets if needed.
Data storage (TSM & TSI): data points are stored in time-structured merge tree (TSM), and time series index (TSI) files. TSM can be thought of a LSM tree with write-ahead log (WAL) and read-only files similar to SSTables that are sorted and compressed. In the event of high cardinality (i.e., large number of elements in a set), performance can decrease. The proprietary InfluxDB Cloud (IOx) claims high cardinality is not an issue anymore.
Query languages: InfluxDB originally supported only InfluxQL, an SQL-like language, but in 2018 its CTO announced Flux, a domain-specific functional language developed by InfluxDB, which became then the recommended language. InfluxDB's decision to create its custom functional data scripting language (Flux) presents another layer of complexity to mastering this ecosystem. The team at InfluxDB points to two motivations to move away from SQL-like InfluxQL to Flux. Flux syntax requires some effort to adapt, especially if you're looking for simple SQL queries or not looking to learn yet another new language. Still considering the large community and integrations that InfluxDB has assembled, some of the advantages of Flux start to materialize, especially when combined with built-in dashboards.
In the latest (non open source) InfluxDB Cloud IOx, SQL is now the preferred language.
Overall, InfluxDB is a great choice if the time series data fits nicely with the tagset model. Their main use case seems to be geared towards infrastructure/application monitoring, but as the market leader in this space, InfluxDB also seamlessly integrates with popular data sources.
- Pros: schemaless ingestion, huge community, integrations with popular tools
- Cons: performance decreases on datasets with high-cardinality. Open Source version supports a single instance. Multiple versions with different database engines, supported languages, and features. New querying language (Flux) to learn.
Whereas InfluxDB opted to build a new database and custom language from scratch, on the other end of the spectrum is TimescaleDB. TimescaleDB is built as an extension on top of PostgreSQL and adds an intermediary layer called hypertables that chunks data into multiple underlying tables while abstracting it as a single, large table for interacting with the data.
PostgreSQL compatibility is TimescaleDB's biggest selling point. TimescaleDB fully supports all of the SQL features (e.g., joins, secondary and partial indexes) as well as popular extensions like PostGIS. More importantly, TimescaleDB inherits decades of knowledge from both developers running SQL queries as well as database and sysadmins running PostgreSQL at scale. Since TimescaleDB is just a PostgreSQL extension, cloud-managed options (e.g. Azure Database for PostgreSQL, Aiven) aside from TimescaleDB's own managed offering are readily available, not to mention the self-managed options on VMs or containers.
Because TimescaleDB began as an IoT platform, where they used InfluxDB at first, to store their sensor data, its features bode well for IoT time series data which is often 'bursty', frequently out of order due to network unreliability, and characterized by high cardinality:
Hypertables: Time-series functionality offerred by TimescaleDB cannot be used on regular PostgreSQL tables. TimescaleDB introduces the concept of a hypertable instead. Even if the user experience is similar to a regular table, hypertables have more complexity behind the scenes. TimescaleDB partitions its hypertables into chunks based on a time column as well as other "spatial" values such as a device uid, location identifier, or a stock symbol. Users can configure these chunks to hold the most recent data in memory, asynchronously compress and re-order data by the time column to disk (instead of ingestion time), and replicate or migrate transactionally across nodes.
Continuous Aggregation: TimescaleDB also supports continuous aggregation of data to make computing key metrics like hourly average, minimum, and maximum values fast. Behind the scenes, the data is stored using PostgreSQL materialized views; additional storage is needed as a trade-off for faster queries. For time-based aggregations (e.g., average temperature between 3 pm and 4 pm vs. the exact temperature at 3 pm), scanning fewer data for each aggregation can help performance-wise.
Data Retention: Large deletes are a costly operation in traditional relational databases. However, since TimescaleDB stores data in chunks, it provides a
drop_chunksfeature to quickly drop old data without the same overhead. Since the relevance of old data diminishes over time, TimescaleDB can be used to move older data to save disk space.
Overall, TimescaleDB is a good fit for teams looking for a performance boost without heavy refactoring to migrate off their existing SQL databases. Even though TimescaleDB is still relatively new (first release in 2017), the decision to build on top of PostgreSQL has boosted its adoption numbers to reach the top 5 TSDBs.
One drawback is the lack of a streaming ingestion protocol like the InfluxDB Line Protocol. This makes ingestion less flexible and less focused on performance as a result.
- Pros: PostgreSQL-compatibility, scales well with data cardinality, various deployment models available.
- Cons: enforced schema (might add a bit of complexity and data transformation effort before ingestion), extra storage needed for continuous aggregations, lack of streaming ingestion protocol.
For those looking to take advantage of the flexibility of InfluxDB line protocol and the familiarity of PostgreSQL, a younger time series database may satisfy both requirements without sacrificing performance. QuestDB (YC S20) is an open-source TSDB written in low-latency Java, C++, and Rust. It ranks in #10 on DB-Engines Time Series DBMS, although it officially launched less than three years ago. Alongside the Apache 2.0 licensed project, QuestDB also offers a fully managed Cloud offering, and a commercial Enterprise version. QuestDB implements a highly optimized column-based storage model. The data is sorted by time natively and stored in time partitions. Those partitions are append-only and versioned. QuestDB only lifts the latest version of the partition in memory for any given time-based query. As such, queries do not slow down as more data is being ingested.
By building the database engine from scratch, the QuestDB team focused on three things:
Performance: Solving the ingestion bottleneck with high throughput ingestion, for datasets of all sizes. QuestDB addresses one of the pain points of InfluxDB, with its ability to handle high cardinality datasets. To speed up queries, QuestDB leverages modern hardware with SIMD instructions, aggregating multiple rows simultaneously. The engineering team built a JIT compiler designed to parallelize query filters during execution. QuestDB stores data in time partitions; this reduces the volume of data which has to fit in memory for reads. Finally, the database is column based and can handle a high degree of parallelism, slicing and processing columns in parallel by all available threads.
Compatibility: QuestDB supports InfluxDB line protocol, PostgreSQL wire, REST API and CSV upload to ingest data. Users accustomed to other time-series databases can easily port over their existing applications without a significant rewrite, and users can choose amongst a set of ingestion protocols depending on their use cases. There are client libraries in multiple languages for the InfluxDB Line Protocol.
Querying via SQL: Despite supporting multiple ingestion mechanisms, QuestDB uses SQL as the query language, so there's no need to learn a domain-specific language like Flux. Additionally, there are time-series-specific SQL extensions to handle time-series data with ease.
In terms of performance, QuestDB
recently posted some benchmark results
with an ingestion rate reaching 4.3 million rows per second. This was computed
through the TSBS benchmark, currently
maintained by TimescaleDB, selecting the
cpu-only use case. The chosen
instance had 32 CPUs and 64GB of RAM, with a fast SSD disk (NVMe). Fast disks
allow to better cope with writes that are I/O bound - in this case when a lot of
out of order data is being ingested concurrently. For most use cases, slower
disks such as AWS EBS volumes are enough to extract sufficient performance.
The other interesting component of QuestDB is support for both InfluxDB Line Protocol and PostgreSQL wire for ingestion. For existing InfluxDB users, you can configure Telegraf to point to QuestDB's address and port. PostgreSQL users can leverage their client libraries or JDBC to write data into QuestDB. Regardless of the ingestion method, data can be queried using standard SQL with notable exceptions listed on the API reference page.
Even though a single QuestDB instance can handle multi-billion row datasets, one of its shortcomings at the time of writing is the lack of replication, which will be available to open source users soon. When it comes to third party integrations, QuestDB integrates with some of the most popular tools (e.g., PostgreSQL, Grafana, Pandas, Apache Kafka, Apache Flink, MindsDB, Telegraf, Kubernetes...). There are more integrations underway, such as more dashboards / BI tools.
- Pros: fast ingestion, better performance using fewer resources, support for both InfluxDB protocol and PostgreSQL wire, querying via standard SQL, optimized queries with SIMD, open source and fully managed QuestDB versions available.
- Cons: smaller community, reduced number of available integrations, replication still under development.
As the demand for time series data continues to grow, time-series databases specialized to deal with this data will see massive adoption and fierce competition. Besides the three open-source TSDBs covered in this article, there are also public cloud offerings from AWS (AWS Timestream) and Azure (Azure Series Insights).
As with all databases, choosing the "perfect" time-series database depends primarily on your business requirements, data model, and use case. InfluxDB works well if your data fits the Tagset model with a rich ecosystem of integrations readily available. TimescaleDB is a natural fit for existing PostgreSQL users. Finally, if performance is the primary concern, QuestDB is a popular and developer-friendly database that is growing rapidly.