Tracking data changes (CDC) in QuestDB

QuestDB is the world's fast growing time-series database. It offers premium ingestion throughput, enhanced SQL analytics that can power through analysis, and cost-saving hardware efficiency. It's open source and integrates with many tools and languages.

Tracking changes in your database is important for applications such as updating machine learning models, pushing data to webhooks, monitoring data ingestion, and much more. QuestDB provides mechanisms to handle these tasks efficiently. In this post, we'll explore how to monitor data changes in QuestDB and leverage these capabilities for advanced use cases.

Why track data changes?

There are a few practical reasons to track data changes:

  1. Real-time integrations: Push data changes to webhooks or other applications in real time, enabling immediate data processing and integrations.
  2. Machine learning: Automatically retrain models when new data reaches a certain threshold, ensuring your models stay current with the latest data.
  3. Continuous table materialization: Continuously update materialized views based on incoming data, ensuring that derived tables remain up-to-date.
  4. Table-level monitoring: Keep track of data ingestion rates, detect anomalies, and ensure your database tables are performing optimally.

Change Data Capture in QuestDB

QuestDB can be used as a target for Change Data Capture (CDC) using tools like Debezium. However, it cannot be used as an origin of CDC out of the box. To track data changes within QuestDB, you can leverage the wal_transactions pseudo-table, which provides metadata about the transactions applied to your tables.

Using wal_transactions

The wal_transactions function provides metadata about the transactions applied to a table. To query this function, run the following:

SELECT * FROM wal_transactions('your_table_name');

This query returns several columns, including:

  • sequencerTxn: The transaction ID.
  • minTimestamp and maxTimestamp: The range of timestamps for the data in the transaction.
  • rowCount: The number of rows affected by the transaction.
  • structureVersion: The version of the table structure at the time of the transaction.

Though do note, there are some caveats.

Important caveats

  1. Beta feature: To get data in columns such as minTimestamp, maxTimestamp, and rowCount, you need to enable a beta feature. This feature might not be fully reliable and could change in the future. We'll demonstrate how in the next section.

  2. Configuration option: Set the configuration option cairo.default.sequencer.part.txn.count to an appropriate value. For example, setting it to 86400 will keep metadata for every transaction for a whole day if you have one transaction per second.

  3. New tables only: This workaround only works for new tables. For existing tables, the minTimestamp, maxTimestamp, and rowCount columns in the wal_transactions table will remain null.

  4. Manual re-conversion: For tables using this workaround, you might need to re-convert the tables manually in the future if QuestDB changes the table formats.

  5. Data order: When data is inserted in order, the minTimestamp and maxTimestamp boundaries will not overlap between transactions. If data can be out of order, the minTimestamp and maxTimestamp will indicate the range of timestamps for the rows added, but you won't know the exact rows added.

Enable beta features

To enable the beta feature, you need to update your QuestDB configuration file or set the corresponding environment variable:

cairo.default.sequencer.part.txn.count=86400

Alternatively, set an environment variable:

export QDB_CAIRO_DEFAULT_SEQUENCER_PART_TXN_COUNT=86400

Once the beta feature is enabled, you can use the wal_transactions table to track changes in your QuestDB tables. However, be mindful of the caveats mentioned above and ensure that your tracking logic accounts for these limitations.

Sample Repository

To help you get started, we've created a sample repository: QuestDB Change Tracker. This repository includes Python scripts that demonstrate how to:

  • Monitor table changes using the wal_transactions table.
  • Aggregate data based on row thresholds.
  • Materialize views dynamically based on data changes.

The scripts are designed to mirror a real-life scenario, but lack robust error controls. They are not intended for production usage, but as a useful starting point to adapt to your use case.

Demonstrating data change tracking in action

Even if QuestDB is a very fast database, some queries over huge amounts of data — potentially joining multiple tables — can take a few seconds, rather than a few milliseconds. While that's adequate when doing interactive data analytics, it is not ideal when exposing dashboards to end users, or when using QuestDB to make near-realtime decisions.

One potential solution for speeding up complex queries, is materializing the results into a table, so subsequents SELECT statements will read from the materialized table and be instantaneous. The problem with materialized tables is that you need to refresh them when you have new data, or otherwise the results get stale.

One way to refresh a materialized table is to refresh on a schedule, executing an INSERT INTO ... SELECT FROM... statement periodically. But this assumes that you have steady input of data. Otherwise, you might be materializing too frequently or too late.

A better option might be to monitor your source tables, and whenever you reach a threshold of new or updated rows, running the materialization query. This demo script does exactly that.

The QuestDB team is already working on native materialized tables, which will be available in the coming months. In the meantime, this script serves as an example of what’s possible with the current features.

In this example, we monitor the smart_meters and trades tables, and when the specified thresholds are met, the script materializes the sampled_meters view. It also keeps track of the transactions it has already seen using a tracking table, so it can survive restarts.

Here’s a sample execution from one of the scripts in the repository:

# Monitor changes in the 'smart_meters' and 'trades' tables and materialize the 'sampled_meters' view
python materialize_view.py --table_names smart_meters,trades --thresholds 100,50 --sql_template_path materialize.sql --check_interval 5 --timestamp_columns smart_meters.timestamp,trades.timestamp --tracking_table materialize_tracker --tracking_id meters_and_trades

The output is such:

Starting from transaction ID: 308 with structure version: 3 for table smart_meters
Starting from transaction ID: 3728 with structure version: 0 for table trades
Executed query:
INSERT INTO sampled_meters(
timestamp, device_id, mark_model,
first_status, last_status, frequency, energy_consumption, voltage, current, power_factor,
price
)
SELECT smart_meters.timestamp, device_id, mark_model,
first(status), last(status),
avg(frequency), avg(energy_consumption), avg(voltage), avg(current),
avg(power_factor), avg(price)
FROM smart_meters ASOF JOIN trades
WHERE smart_meters.timestamp >= '2024-07-29 14:51:34.144738' AND smart_meters.timestamp <= '2024-07-29 14:52:04.044696' AND trades.timestamp >= '2024-07-29 14:51:37.107452' AND trades.timestamp <= '2024-07-29 14:52:06.804897'
SAMPLE BY 10m;
Executed query:
INSERT INTO sampled_meters(
timestamp, device_id, mark_model,
first_status, last_status, frequency, energy_consumption, voltage, current, power_factor,
price
)
SELECT smart_meters.timestamp, device_id, mark_model,
first(status), last(status),
avg(frequency), avg(energy_consumption), avg(voltage), avg(current),
avg(power_factor), avg(price)
FROM smart_meters ASOF JOIN trades
WHERE smart_meters.timestamp >= '2024-07-29 14:52:04.142463' AND smart_meters.timestamp <= '2024-07-29 14:52:23.047161' AND trades.timestamp >= '2024-07-29 14:52:06.905866' AND trades.timestamp <= '2024-07-29 14:52:36.885151'
SAMPLE BY 10m;

Summary

QuestDB's ability to track data changes through the wal_transactions table opens up various possibilities for real-time data processing and analytics. Whether you need to perform real-time integrations, update machine learning models, continuously materialize views, or monitor table performance, you can leverage these capabilities to build robust solutions.

Come talk to us on Slack or Discourse!

RedditHackerNewsX
Subscribe to our newsletters for the latest. Secure and never shared or sold.