Starting from QuestDB 7.3, there is an option to enable storage-level data deduplication on a table. Data deduplication works on all the data inserted into the table and replaces matching rows with the new versions. Only new rows that do no match existing data will be inserted.
Deduplication can only be enabled for Write-Ahead Log (WAL) tables.
Deduplication in QuestDB makes table inserts idempotent. The primary use case is to allow for re-sending data within a given time range without creating duplicates.
This can be particularly useful in situations where there is an error in sending data, such as when using InfluxDB Line Protocol, and there is no clear indication of how much of the data has already been written. With deduplication enabled, it is safe to re-send data from a fixed period in the past to resume the writing process.
Enabling deduplication on a table has an impact on writing performance,
especially when multiple
UPSERT KEYS are configured. Generally, if the data
have mostly unique timestamps across all the rows, the performance impact of
deduplication is low. Conversely, the most demanding data pattern occurs when
there are many rows with the same timestamp that need to be deduplicated on
For example, in use cases where 10 million devices send CPU metrics every second precisely, deduplicating the data based on the device ID can be expensive. However, in cases where CPU metrics are sent at random and typically have unique timestamps, the cost of deduplication is negligible.
Create a WAL-enabled table with deduplication using
CREATE TABLE syntax.
Enable or disable deduplication at any time for individual tables using the following statements:
UPSERT KEYS ensure that deduplication functions as expected.
UPSERT is an abbreviation for UPDATE or INSERT, which is a common database
concept. It means that the new row UPDATEs the existing row (or multiple rows
in the general case) when the matching criteria are met. Otherwise, the new row
is INSERTed into the table. In QuestDB deduplication, the UPSERT matching
criteria are set by defining a column list in the
UPSERT KEYS clause in the
ALTER table statement.
UPSERT KEYS can be changed at any time. It can contain one or more columns.
However, there are some limitations on the
UPSERT KEYS list:
- The Designated Timestamp column must be
included in the
- Columns of STRING and BINARY types cannot be
used in the
The easiest way to explain the usage of
UPSERT KEYS is through an example:
In this example, the deduplication keys are set to the
ts column, which is the
designated timestamp, and the
ticker column. The intention is to have no more
than one price point per ticker at any given time. Therefore, if the same
price/day combination is sent twice, only the last price is saved.
The following inserts demonstrate the deduplication behavior:
In this case, deduplication overwrites row 1 with row 2 because both
deduplication keys have the same values:
The same behavior applies to the second pair of rows, where row 4 overwrites
As a result, the table contains only two rows:
Regardless of whether the inserts are executed in a single transaction/batch or as individual inserts, the outcome remains unchanged as long as the order of the inserts is maintained.
Deduplication can be disabled using the DEDUP DISABLE SQL statement:
This reverts the table to behave as usual, allowing the following inserts:
These inserts add two more rows to the TICKER_PRICE table:
Deduplication can be enabled again at any time:
Enabling deduplication does not have any effect on the existing data and only applies to newly inserted data. This means that a table with deduplication enabled can still contain duplicate data.
Enabling deduplication does not change the number of rows in the table. After enabling deduplication, the following inserts demonstrate the deduplication behavior:
After these inserts, all rows with
replaced, first by row 1 and then by row 2, and the price is set to 91.16:
It is possible to utilize metadata tables query to verify whether deduplication is enabled for a specific table:
The function table_columns can be used to identify which columns are configured as deduplication UPSERT KEYS: