
#
IntroductionFor most applications dealing with time series data, the value of each data point diminishes over time as the granularity of the dataset loses relevance as it gets stale. For example, when applying a real-time anomaly detection model, more granular data (e.g., data collected at second resolution), would yield better results. However, to train forecasting models afterwards, recording data at such high frequency may not be needed and would be costly in terms of storage and compute.
When I was working for an IoT company, to combat this issue, we stored data in three separate databases. To show the most up to date value, latest updates were pushed to a NoSQL realtime database. Simultaneously, all the data was appended to both a time series database storing up to 3 months of data for quick analysis and to an OLAP database for long-term storage. To stop the time series database from exploding in size, we also ran a nightly job to delete old data. As the size of the data grew exponentially with IoT devices, this design caused operational issues with maintaining three different databases.
QuestDB solves this by providing easy ways to downsample the data and also detach or drop partitions when old data is no longer necessary. This helps to keep all the data in a single database for most operations and move stale data to cheaper storage in line with a mature data retention policy.
To illustrate, letβs revisit the IoT application involving heart rate data. Unfortunately, Google decided to shut down its Cloud IoT Core service, so weβll use randomized data for this demo.
#
Populating heart rate dataLetβs begin by running QuestDB via Docker:
Weβll create the a simple heart-rate data table with a timestamp, heart rate, and sensor ID partitioned by month via the console at localhost:9000:
We now have randomized data from 10,000 sensors over ~2 months time frame (10M data points). Suppose we are continuously appending to this dataset from a data stream, then having such frequent updates will be useful to detect anomalies in heart rate. This could be useful to detect and alert on health issues that could arise.
#
Downsampling the dataHowever, if no anomalies are detected, having a dataset with heart rate collected every second is not useful if we simply want to note general trends over time. Instead we can record the average heart rate in one hour intervals to compact data. For example, if weβre interested in the min, max, and avg heart rate of a specific sensor, sampled every hour, we can invoke:
Once you are happy with the downsampled results, we can store those results into a separate sampled_data table for other data science time to create forecasting models or do further analysis:
This downsampling operation can be done periodically (e.g., daily, monthly) to populate the new table. This way the data science team does not have to import the massive raw dataset and can simply work with sampled data with appropriate resolution.
#
Data retention strategyDownsampling alone, however, does not solve the growing data size. The raw
sensor heart_rate
table will continue to grow in size. In this case, we have
some options in QuestDB to detach or even drop partitions.
Since we partitioned the original dataset by month, we have 3 partitions:
2022β10
, 2022β11
, and 2022β12
. This can be seen under /db/heart_rate/
directories, along with other files holding metadata.
After we have downsampled the data, we probably no longer need data from older months. In this case, we can DETACH this partition to make it unavailable for reads.
Now the 2022β10
partition is renamed to 2022β10.detached
and running queries
in the heart_rate
table returns data from 2022β11
onwards:
We can then compress this data and move it to a cheaper block storage option like S3 or GCS:
If we need to restore this partition for further analysis, we can re-download
the tar file to a new directory named PARTITION-NAME.attachable
under /db/
(or where the rest of the QuestDB data lives) and uncompress the tar file:
With the data in place, simply use the ATTACH command:
We can verify the partition has been attached back by running the count query and seeing 10M records:
Alternatively, if we want to simply delete partitions, we can use the DROP command to do so. Unlike the DETACH command, this operation is irreversible: