SQL Extensions for Time Series Data in QuestDB - Part II

Kovid Rathee

Kovid Rathee

Guest post

This post comes from Kovid Rathee, who, following up on his first tutorial, has put together another tutorial on SQL extensions for time series data in QuestDB.

A screenshot of the QuestDB Web Console with a SAMPLE BY query

Introduction#

This tutorial follows up on our previous one, where we introduced SQL extensions in QuestDB that make time-series analysis easier. Today, you will learn about the SAMPLE BY extension in detail, which will enable you to work with time-series data efficiently because of its simplicity and flexibility.

To get started with this tutorial, you should know that SAMPLE BY is a SQL extension in QuestDB that helps you group or bucket time-series data based on the designated timestamp. This removes the need for lengthy CASE WHEN statements and GROUP BY clauses. Not only that, the SAMPLE BY extension enables you to quickly deal with many other data-related issues, such as missing data, incorrect timezones, and offsets.

This tutorial assumes you have an up-and-running QuestDB instance ready for use. Let's dive straight into it.

Setup#

Import sample data#

Similar to the previous tutorial, we'll use the NYC taxi rides data for February 2018. You can use the following script that utilizes the HTTP REST API to upload data into QuestDB:

curl https://s3-eu-west-1.amazonaws.com/questdb.io/datasets/grafana_tutorial_dataset.tar.gz > grafana_data.tar.gz
tar -xvf grafana_data.tar.gz
curl -F data=@taxi_trips_feb_2018.csv http://localhost:9000/imp
curl -F data=@weather.csv http://localhost:9000/imp

Alternatively, you can use the import functionality in the QuestDB console, as shown in the image below:

Create an ordered timestamp column#

The SAMPLE BY keyword mandates the use of the designated timestamp column to enable further analysis. Therefore, you'll have to elect the pickup_datetime column as the designated timestamp in a new table called taxi_trips with the script below:

CREATE TABLE taxi_trips AS (
SELECT *
FROM 'taxi_trips_feb_2018.csv'
ORDER BY pickup_datetime
) TIMESTAMP(pickup_datetime)
PARTITION BY MONTH;

By converting the pickup_datetime column to timestamp, you are allowing QuestDB to use it as the table's designated timestamp. Using this designated timestamp column, QuestDB is able to index the table to run time-based queries more efficiently. If it all goes well, you should see the following data after running a SELECT * query on the taxi_trips table:

Understanding the basics of SAMPLE BY#

The SAMPLE BY extension allows you to create groups and buckets of data based on time ranges. This is especially valuable for time-series data as you can calculate frequently used aggregates with extreme simplicity. SAMPLE BY offers you the ability to summarize or aggregate data from very fine to very coarse units of time, i.e., from microseconds to months and everything in between (milliseconds, seconds, minutes, hours, and days). You can also derive other units of time, such as a week, fortnight, and year from the ones provided out of the box.

Let's look at some examples to understand how to use SAMPLE BY in different scenarios.

Hourly count of trips#

You can use the SAMPLE BY keyword with the sample unit of h to get an hour-by-hour count of trips for the whole duration of the data set. Running the following query, you'll get results in the console:

SELECT
pickup_datetime,
COUNT() total_trips
FROM
'taxi_trips' SAMPLE BY 1h;

There are two ways you can read your data in the QuestDB console: using the grid, which has a tabular form factor, or using a chart, where you can draw up a line, bar, or an area chart to visualize your data. Here's an example of a bar chart drawn from the above query:

Three-hourly holistic summary of trips#

The SAMPLE BY extension allows you to group data by any arbitrary number of sample units. In the following example, you'll see that the query is calculating a three-hourly summary of trips with multiple aggregate functions:

SELECT
pickup_datetime,
COUNT() total_trips,
SUM(passenger_count) total_passengers,
ROUND(AVG(trip_distance), 2) avg_trip_distance,
ROUND(SUM(fare_amount)) total_fare_amount,
ROUND(SUM(tip_amount)) total_tip_amount,
ROUND(SUM(fare_amount + tip_amount)) total_earnings
FROM
'taxi_trips' SAMPLE BY 3h;

You can view the output of the query in the following grid on the QuestDB console:

Weekly summary of trips#

As mentioned above, although there's no sample unit for a week, a fortnight, or a year, you can derive them simply by utilizing the built-in sample units. If you want to sample the data by a week, use 7d as the sampling time, as shown in the query below:

SELECT
pickup_datetime,
COUNT() total_trips,
SUM(passenger_count) total_passengers,
ROUND(AVG(trip_distance), 2) avg_trip_distance,
ROUND(SUM(fare_amount)) total_fare_amount,
ROUND(SUM(tip_amount)) total_tip_amount,
ROUND(SUM(fare_amount + tip_amount)) total_earnings
FROM
'taxi_trips'
WHERE
pickup_datetime BETWEEN '2018-02-01' AND '2018-02-28' SAMPLE BY 7d;

Dealing with missing data#

If you've worked a fair bit with data, you already know that data isn't always in a pristine state. One of the most common issues, especially with time-series data, is discontinuity, i.e., scenarios where data is missing for specific time periods. You can quickly identify and deal with missing data using the advanced functionality of the SAMPLE BY extension.

QuestDB offers an easy way to generate and fill in missing data with the SAMPLE BY clause. Take the following example: I've deliberately removed data from 4 am to 5 am for the 1st of February 2018. Notice how the FILL keyword, when used in conjunction with the SAMPLE BY extension, can generate a row for the hour starting at 4 am and fill it with data generated from linear interpolation of the 2 surrounding points:

SELECT
pickup_datetime,
COUNT() total_trips,
SUM(passenger_count) total_passengers,
ROUND(AVG(trip_distance), 2) avg_trip_distance,
ROUND(SUM(fare_amount)) total_fare_amount,
ROUND(SUM(tip_amount)) total_tip_amount,
ROUND(SUM(fare_amount + tip_amount)) total_earnings
FROM
'taxi_trips'
WHERE
pickup_datetime NOT BETWEEN '2018-02-01T04:00:00' AND '2018-02-01T04:59:59' SAMPLE BY 1h FILL(LINEAR);

The FILL keyword demands a fillOption from the following:

fillOptionUsage scenarioNotes
NONEWhen you don't want to populate missing data, and leave it as isThis is the default fillOption
NULLWhen you want to generate rows for missing time periods, but leave all the values as NULLs
PREVWhen you want to copy the values of the previous row from the summarized dataThis is useful when you expect the numbers to be similar to the preceding time period
LINEARWhen you want to normalize the missing values, you can take the average of the immediately preceding and following row
CONST or xWhen you want to hardcode values where data is missingFILL (column_1, column_2, column_3, ...)

Here's another example of hardcoding values using the FILL(x) fillOption:

In the example above, we've used an inline WHERE clause to emulate missing data with the help of the NOT BETWEEN keyword. Alternatively, you can create a separate table with missing trips using the same idea, as shown below:

CREATE TABLE 'taxi_trips_missing' AS (
SELECT *
FROM 'taxi_trips'
WHERE
pickup_datetime NOT BETWEEN '2018-02-01T04:00:00' AND '2018-02-01T04:59:59'
);

Working with timezones and offsets#

The SAMPLE BY extension also enables you to change timezones and add or subtract offsets from your timestamp columns to adjust for any issues you might encounter when dealing with different source systems, especially in different geographic areas. It is important to note that, by default, QuestDB aligns its sample calculation based on the FIRST OBSERVATION, as shown in the example below:

SELECT
pickup_datetime,
COUNT() total_trips,
SUM(passenger_count) total_passengers,
ROUND(AVG(trip_distance), 2) avg_trip_distance,
ROUND(SUM(fare_amount)) total_fare_amount,
ROUND(SUM(tip_amount)) total_tip_amount,
ROUND(SUM(fare_amount + tip_amount)) total_earnings
FROM
'taxi_trips'
WHERE
pickup_datetime BETWEEN '2018-02-01T13:35:52' AND '2018-02-28' SAMPLE BY 1d;

Note that now the 1d sample calculation starts at 13:35:52 and ends at 13:35:51 the next day. Apart from the option demonstrated above, there are two other ways to align your sample calculations - to the calendar time zone, and to calendar with offset.

Let's take a look at the other two alignment methods.

Aligning sample calculation to another timezone#

When moving data across systems, pipelines, and warehouses, you can encounter issues with time zones. For the sake of demonstration, let's assume that you're working in New York City, but you've identified that the timestamps of the data set you've loaded into the database are in Australian Eastern Time (instead of New York's EST). Traditionally, this could lead to extra conversion work to ensure that this new data is comparable to the rest of your data in EST.

QuestDB allows you to easily fix this issue by aligning your data to another timezone using the ALIGN TO CALENDAR TIME ZONE option with the SAMPLE BY extension. In the example shown below, you can see how an ALIGN TO CALENDAR TIME ZONE ('AEST') has aligned the pickup_datetime, i.e., the designated timestamp column to the AEST timezone for Melbourne.

SELECT
pickup_datetime,
COUNT() total_trips,
SUM(passenger_count) total_passengers,
ROUND(AVG(trip_distance), 2) avg_trip_distance,
ROUND(SUM(fare_amount)) total_fare_amount,
ROUND(SUM(tip_amount)) total_tip_amount,
ROUND(SUM(fare_amount + tip_amount)) total_earnings
FROM
'taxi_trips' SAMPLE BY 3h ALIGN TO CALENDAR TIME ZONE ('AEST');

Aligning sample calculation with offsets#

Similar to the previous example, you can also align the sample calculation by offsetting the designated timestamp column manually by any hh:mm value between -23:59 to 23:59. In the following example, we're offsetting the sample calculation by -5:30, i.e., negative five hours and thirty minutes:

SELECT
pickup_datetime,
COUNT() total_trips,
SUM(passenger_count) total_passengers,
ROUND(AVG(trip_distance), 2) avg_trip_distance,
ROUND(SUM(fare_amount)) total_fare_amount,
ROUND(SUM(tip_amount)) total_tip_amount,
ROUND(SUM(fare_amount + tip_amount)) total_earnings
FROM
'taxi_trips' SAMPLE BY 3h ALIGN TO CALENDAR WITH OFFSET '-05:30';

Conclusion#

In this tutorial, you learned how to exploit the SAMPLE BY extension in QuestDB to work efficiently with time-series data, especially in aggregated form. In addition, the SAMPLE BY extension also allows you to fix common problems with time-series data attributable to complex data pipelines, disparate source systems in different geographical areas, software bugs, etc. All in all, SQL extensions in QuestDB, like SAMPLE BY, provide a significant advantage when working with time-series data by enabling you to achieve more in fewer lines of SQL.

Join our developer community

QuestDB is open source. Follow us on Twitter, star our GitHub repo, and join our developer community on Slack!

Subscribe to our newsletter

Stay up to date with all things QuestDB