Tracking data changes (CDC) in QuestDB
Tracking changes in your database is important for applications such as updating machine learning models, pushing data to webhooks, monitoring data ingestion, and much more. QuestDB provides mechanisms to handle these tasks efficiently. In this post, we'll explore how to monitor data changes in QuestDB and leverage these capabilities for advanced use cases.
Why track data changes?
There are a few practical reasons to track data changes:
- Real-time integrations: Push data changes to webhooks or other applications in real time, enabling immediate data processing and integrations.
- Machine learning: Automatically retrain models when new data reaches a certain threshold, ensuring your models stay current with the latest data.
- Continuous table materialization: Continuously update materialized views based on incoming data, ensuring that derived tables remain up-to-date.
- Table-level monitoring: Keep track of data ingestion rates, detect anomalies, and ensure your database tables are performing optimally.
Change Data Capture in QuestDB
QuestDB can be used as a target for Change Data Capture (CDC) using tools like
Debezium. However, it cannot be used as an origin of CDC out of the box. To
track data changes within QuestDB, you can leverage the wal_transactions
pseudo-table, which provides metadata about the transactions applied to your
tables.
Using wal_transactions
The wal_transactions
function provides metadata about the transactions applied
to a table. To query this function, run the following:
SELECT * FROM wal_transactions('your_table_name');
This query returns several columns, including:
sequencerTxn
: The transaction ID.minTimestamp
andmaxTimestamp
: The range of timestamps for the data in the transaction.rowCount
: The number of rows affected by the transaction.structureVersion
: The version of the table structure at the time of the transaction.
Though do note, there are some caveats.
Important caveats
-
Beta feature: To get data in columns such as
minTimestamp
,maxTimestamp
, androwCount
, you need to enable a beta feature. This feature might not be fully reliable and could change in the future. We'll demonstrate how in the next section. -
Configuration option: Set the configuration option
cairo.default.sequencer.part.txn.count
to an appropriate value. For example, setting it to86400
will keep metadata for every transaction for a whole day if you have one transaction per second. -
New tables only: This workaround only works for new tables. For existing tables, the
minTimestamp
,maxTimestamp
, androwCount
columns in thewal_transactions
table will remain null. -
Manual re-conversion: For tables using this workaround, you might need to re-convert the tables manually in the future if QuestDB changes the table formats.
-
Data order: When data is inserted in order, the
minTimestamp
andmaxTimestamp
boundaries will not overlap between transactions. If data can be out of order, theminTimestamp
andmaxTimestamp
will indicate the range of timestamps for the rows added, but you won't know the exact rows added.
Enable beta features
To enable the beta feature, you need to update your QuestDB configuration file or set the corresponding environment variable:
cairo.default.sequencer.part.txn.count=86400
Alternatively, set an environment variable:
export QDB_CAIRO_DEFAULT_SEQUENCER_PART_TXN_COUNT=86400
Once the beta feature is enabled, you can use the wal_transactions
table to
track changes in your QuestDB tables. However, be mindful of the caveats
mentioned above and ensure that your tracking logic accounts for these
limitations.
Sample Repository
To help you get started, we've created a sample repository: QuestDB Change Tracker. This repository includes Python scripts that demonstrate how to:
- Monitor table changes using the
wal_transactions
table. - Aggregate data based on row thresholds.
- Materialize views dynamically based on data changes.
The scripts are designed to mirror a real-life scenario, but lack robust error controls. They are not intended for production usage, but as a useful starting point to adapt to your use case.
Demonstrating data change tracking in action
Even if QuestDB is a very fast database, some queries over huge amounts of data — potentially joining multiple tables — can take a few seconds, rather than a few milliseconds. While that's adequate when doing interactive data analytics, it is not ideal when exposing dashboards to end users, or when using QuestDB to make near-realtime decisions.
One potential solution for speeding up complex queries, is materializing the
results into a table, so subsequents SELECT
statements will read from the
materialized table and be instantaneous. The
problem with materialized tables is that you need to refresh them when you have
new data, or otherwise the results get stale.
One way to refresh a materialized table is to refresh on a schedule, executing
an INSERT INTO ... SELECT FROM...
statement periodically. But this assumes
that you have steady input of data. Otherwise, you might be materializing too
frequently or too late.
A better option might be to monitor your source tables, and whenever you reach a threshold of new or updated rows, running the materialization query. This demo script does exactly that.
The QuestDB team is already working on native materialized tables, which will be available in the coming months. In the meantime, this script serves as an example of what’s possible with the current features.
In this example, we monitor the smart_meters
and trades
tables, and when the
specified thresholds are met, the script materializes the sampled_meters
view.
It also keeps track of the transactions it has already seen using a tracking
table, so it can survive restarts.
Here’s a sample execution from one of the scripts in the repository:
# Monitor changes in the 'smart_meters' and 'trades' tables and materialize the 'sampled_meters' viewpython materialize_view.py --table_names smart_meters,trades --thresholds 100,50 --sql_template_path materialize.sql --check_interval 5 --timestamp_columns smart_meters.timestamp,trades.timestamp --tracking_table materialize_tracker --tracking_id meters_and_trades
The output is such:
Starting from transaction ID: 308 with structure version: 3 for table smart_metersStarting from transaction ID: 3728 with structure version: 0 for table tradesExecuted query:INSERT INTO sampled_meters(timestamp, device_id, mark_model,first_status, last_status, frequency, energy_consumption, voltage, current, power_factor,price)SELECT smart_meters.timestamp, device_id, mark_model,first(status), last(status),avg(frequency), avg(energy_consumption), avg(voltage), avg(current),avg(power_factor), avg(price)FROM smart_meters ASOF JOIN tradesWHERE smart_meters.timestamp >= '2024-07-29 14:51:34.144738' AND smart_meters.timestamp <= '2024-07-29 14:52:04.044696' AND trades.timestamp >= '2024-07-29 14:51:37.107452' AND trades.timestamp <= '2024-07-29 14:52:06.804897'SAMPLE BY 10m;Executed query:INSERT INTO sampled_meters(timestamp, device_id, mark_model,first_status, last_status, frequency, energy_consumption, voltage, current, power_factor,price)SELECT smart_meters.timestamp, device_id, mark_model,first(status), last(status),avg(frequency), avg(energy_consumption), avg(voltage), avg(current),avg(power_factor), avg(price)FROM smart_meters ASOF JOIN tradesWHERE smart_meters.timestamp >= '2024-07-29 14:52:04.142463' AND smart_meters.timestamp <= '2024-07-29 14:52:23.047161' AND trades.timestamp >= '2024-07-29 14:52:06.905866' AND trades.timestamp <= '2024-07-29 14:52:36.885151'SAMPLE BY 10m;
Summary
QuestDB's ability to track data changes through the wal_transactions
table
opens up various possibilities for real-time data processing and analytics.
Whether you need to perform real-time integrations, update machine learning
models, continuously materialize views, or monitor table performance, you can
leverage these capabilities to build robust solutions.