The future of fast databases: Lessons from a decade of QuestDB
When was the last time you wished your database was slower at ingesting or querying data? Exactly — no one ever does. That’s the question I posed to the audience at Big Data London, and as expected, not a single hand went up. I followed up with another: “How excited would you be to work with a new database that uses non-standard data storage, requiring you to write custom ETL just to integrate it with your existing systems?” Still, no hands.
This framed the conversation about fast databases — what they mean today and how they’re evolving to meet the demands of the future. To me, a fast database is one designed for frequent, performant ingestion of millions of records and efficient querying of datasets with billions of records. But before looking ahead, let’s revisit the past.
Watch the Big Data LDN 2024 talk and read along!
A look back: Databases in the 90s
In the 90s, databases were often the bottleneck in projects. Systems weren’t designed for the real-time data demands of today, handling thousands or - at most - hundreds of thousands of records. Databases back then followed the OLTP model, optimized for reads, not writes, and indexing was crucial for speeding up queries.
The arrival of OLAP and NoSQL databases brought new capabilities. NoSQL offered fast inserts and simple, non-analytical queries, while OLAP was optimized for batch inserts and complex analytical queries. With the success of MapReduce and HDFS, OLAP systems introduced distributed query capabilities, paving the way for the emergence of data lakes.
However, the data lake model had a significant limitation: immutability. Once data was written, updating records was a costly and slow process, especially with object storage. These systems thrived on batch processing but were unsuitable for real-time streaming data—enter time-series databases.
The power of streaming and time-series databases
Today, real-time data is essential. Everyone wants their data as fresh as possible. However, streaming data presents challenges: it’s large, continuous, and often incomplete. It arrives in bursts, may be out of order, and is frequently updated after initial processing.
This is where time-series databases shine. They are designed to handle fast ingestion, perform rapid querying of recent data, and offer powerful time-based analytics. To maintain performance, time-series databases often downsample or delete older data as a trade-off for speed. Optimized time-series databases will also excel within wide shapes of data, such as high cardinality data.
If you've never used a time-series database, it might help to get your hands on one. QuestDB offers a live, public demo. You can launch it right form compatible code snippets.
For demonstration purposes, let's use the powerful SAMPLE BY SQL extension to look at a historical stock price. This extension summarizes large datasets into aggregates of similar time chunks as part of a SELECT statement:
SELECT timestamp, avg(l2price(5000,ask_sz_00, ask_px_00,ask_sz_01, ask_px_01,ask_sz_02, ask_px_02,ask_sz_03, ask_px_03,ask_sz_04, ask_px_04,ask_sz_05, ask_px_05,ask_sz_06, ask_px_06,ask_sz_07, ask_px_07,ask_sz_08, ask_px_08,ask_sz_09, ask_px_09) - ask_px_00)FROM AAPL_orderbookWHERE timestamp IN '2023-08-25T13:30:00;6h'SAMPLE BY 10m;
The future of fast databases
QuestDB is already known for both fast ingestion and fast querying, making it ideal for near real-time dashboards and low-latency applications. But that alone isn’t enough anymore. Two major trends are reshaping the expectations of database users:
-
Open file formats: Formats like Apache Parquet, Apache Hudi, and Delta Lake are enabling a new era of data lakes with mutable data, schema evolution, and streaming. These open formats allow multiple applications and engines to access the same dataset without duplication. Databases that cannot work with such formats are becoming less attractive.
-
Machine learning and data science: With the rise of data science and machine learning, the output of a query can range from a few rows to power a dashboard to millions of rows required to train an ML model. This means egress (getting data out) must be as fast as ingress (getting data in).
What a future-proof fast database should support
To stay relevant in this evolving landscape, fast databases should contain the following:
Distributed computing
Decoupled storage from computation is essential for scalability. It allows for seamless interaction between recent and historical data.
Open formats
Data should be stored in formats like Apache Parquet to allow direct access without always relying on the database engine. The feature rollout within QuestDB is already underway.
Seamless interoperability
By supporting open formats, databases should be able to query data generated elsewhere, without needing to re-ingest it.
QuestDB’s next steps
So, what’s next for QuestDB? We’re building a distributed query engine that’s decoupled from storage, with data stored in compressed Parquet. This allows users to bypass the QuestDB query engine entirely when accessing full datasets from third-party tools. A subset of the functionality we are building around Parquet is already available today.
You can access Parquet files and query them via time-series extensions:
SELECT timestamp, avg(price)FROM (read_parquet('trades.parquet') timestamp(timestamp))SAMPLE BY 15m;
Another key feature we’re developing is the first mile of data. Recent data will be stored in our binary WAL format. This ensures that fresh data can be queried quickly, while older data, stored in Parquet, can still be accessed with slightly higher latency. This design balances speed for real-time data with efficient long-term storage.
Additionally, QuestDB is adopting Apache Arrow, an open memory format that eliminates data deserialization on streams. Arrow is widely used by tools like Apache Spark, Pandas, and Dask. By using Arrow’s ADBC protocol, we can stream result sets in a columnar format directly in Arrow, eliminating the overhead of converting data formats and significantly reducing CPU usage.
For queries with large result sets, like interpolations that fill gaps in a table, users will be able to consume data directly as it streams over the network, making egress as fast as ingress and enabling data science and ML workflows on top of QuestDB.
Conclusion
As we move forward, QuestDB continues to evolve to meet the needs of both real-time and analytical workloads. We are building a high-performance, flexible database that supports distributed computation, open file formats, and fast ingestion and egress.
Whether you’re looking for an open-source solution for production workloads or need enterprise-grade capabilities, QuestDB is designed to deliver both speed and flexibility.
If you would like to see the recorded version of the talk at Big Data London, you can also find it directly on on YouTube.
To learn more, check out our GitHub repository, try the demo, or join the conversation on Slack. The future of fast databases is here, and we’re excited to keep pushing the boundaries.