Headlining our latest release is support for data deduplication and the IPv4 data type. Additionally, we are pleased to offer new SQL syntax for combining two longs into a single UUID, improved tuning for WAL tables, and several fixes and stability improvements.
Let's take a deeper look.
Data deduplication preserves the clarity of data from any source. With deduplication, duplicates are discarded discretely before they hit the disk. As a result, they will not contribute to an increase in storage cost or SQL query performance. And the best part?
Data deduplication has a negligible impact on ingestion performance!
You can turn it on, and keep it on.
Those who know time-series data know all about the madness of duplicate data. When many different sources blast streams of fluctuating data together, there is bound to be overlap. If unintended, this overlap burns storage resources, and compromises the integrity of sensitive data sets.
Consider a pair of common use cases for deduplication...
A bursting data stream generates 500,000 data points over a brief window of time. But alas, its latest burst was inaccurate as the stream was mis-calibrated. To try fix it, the operator cancelled the burst mid-send. But even though the burst did not complete, 250,000 rows still made it through with success. Now a range of time is inaccurate.
The operator recalibrates the stream, and the revised data are re-sent over the same time period. It now succeeds in full. Unfortunately, the prior 250,000 inaccurate records are still present. To make matters worse, there are now duplicates that share a timestamp, half correct and half incorrect.
In total, there are 750,000 events instead of the intended 500,000. With deduplication, existing timestamps that match the new insert are overwritten and the remaining sequence streams in as intended. Now the entire burst is in chronology, accurate, clean and whole at 500,000 data points.
What about outside of timestamps? Picture an organization that measures precise temperatures from various sensors. A sensor takes thousands of readings every second.
The organization analyzes fluctuations in temperature from their baseline, which means that an unbroken sequence of repeated readings is not ideal. It just adds noise to their data. Instead, they are most interested in variation.
timestamp column keeps ticking upwards,
temperature repeats - not interesting:
If the temperature changes, then returns to baseline - very interesting:
Deduplication provides a cleaner look.
temperature column folds repetitions together until new values appear:
Deduplication support is a popular request from QuestDB users. And it is easy to understand why! Duplicates muddle data, use unnecessary resources and often require clean-up. Deduplication ensures that any repetitions are handled with grace, even if data originate from multiple sources.
Enable dedup on WAL enabled tables with the following SQL syntax:
For a deeper explanation on how to deduplicate, see the documentation.
IPv4 addresses are a common data type with their own unique numerical convention. We often hear from teams who are tracking the activity of users, servers or sensors, that managing IP addresses without validity checks and type-specific functions can be frustrating.
Those working with IPs will appreciate IPv4 support, which is more flexible and performant with SQL execution, and much kinder on storage demand.
Yes! IPv4 addresses are stored as integers on disk, and passed as strings.
Do note that IPv4 addresses may not contain netmasks when being stored on disk. However, when they are being passed as a string argument to a function - like with subnet inclusion operators:
<< - they may include netmask or subnet with a netmask as demonstrated below.
IPv4 arguments may appear as:
- an ip address without netmask:
String arguments may appear as:
- an ip address without netmask:
- an ip address with netmask:
- a subnet with netmask:
Altogether, we support the full range of IPv4 addresses.
The following operations are supported:
<: is an ip address less than another? (
<=: is an ip address less than or equal to another? (
>: is an ip address greater than another? (
>=: is an ip address greater than or equal to another? (
=: is an ip address equal to another? (
!=is an ip address not equal to another? (
<<: is an ip address strictly contained by a subnet? (
<<=: is an ip address contained by or equal to a subnet? (
&: computes bitwise AND (
~: computes bitwise NOT (
|: computes bitwise OR (
+: adds an offset to an ip address (
ipv4-int: subtracts an offset from an ip address (
ipv4 - ipv4: computes the difference between two ip addresses (
There are also two additional random functions for testing as well as a netmask function:
rnd_ipv4(): random address generator (String)
rnd_ipv4(string, int): random address generator within a range (String)
netmask(string): returns an ip address' subnet mask (
For a full list of operators, see the IPv4 documentation.
The IPv4 data type supports SQL functions, like
LATEST BY and many more. For a full list, checkout our documentation.
We currently support two methods: ILP and INSERT SQL. In either case, IPv4 values are provided as a string. When queried, IPv4 values deliver as string to both HTTP and PostegreSQL clients.
If you already store IPv4 addresses in QuestDB as a string, then there is no need to modify your existing workflow. After upgrade, your deployment is compatible with the IPv4 data type.
Create a new table with IPv4 support using
What if you need to combine 2 IDs from different sources into a one unique ID?
Each long value has 64 bits. UUID values have 128 bits.
Combine two longs into a single UUID:
This results in a unified UUID:
There are no breaking changes in this release.