The story of our SAMPLE BY enhancements

Nick WoolmerCore Database Engineer

August 7, 2024

Tags:

QuestDB is the world's fast growing time-series database. It offers premium ingestion throughput, enhanced SQL analytics that can power through analysis, and cost-saving hardware efficiency. It's open source and integrates with many tools and languages.

It all begins with a simple query, and a not so simple result.

SELECT pickup_datetime as t, count
FROM trips
WHERE pickup_datetime IN '2018'
SAMPLE BY 5d
LIMIT 5

In this query, we use QuestDB's SAMPLE BY¹ syntax to downsample data from the NYC Taxi dataset. This query will sub-sample the trips with start dates in 2018 into buckets of 5 days, and then take the first 5 rows.

Or will it?

The bug

Running this query, we get the following, curious, result:

t	count
2017-12-30T00:00:00.000000Z	808413
2018-01-04T00:00:00.000000Z	1264083
2018-01-09T00:00:00.000000Z	1647305
2018-01-14T00:00:00.000000Z	1580716
2018-01-19T00:00:00.000000Z	1542316

Despite filtering the data to be IN '2018'², the first bucket starts in '2017'. Most people would not expect this result set. With 2017 as our first result, it looks more like a bug in the SAMPLE BY code. However, let's keep an open mind as we investigate further.

EXPLAINing things

We obtain the query plan by re-running the query with EXPLAIN:

EXPLAIN
SELECT pickup_datetime as t, count
FROM trips
WHERE pickup_datetime IN '2018'
SAMPLE BY 5d
LIMIT 5

This is the plan:

Sort light lo: 5
  keys: [t]
    Async Group By workers: 46
      keys: [t]
      values: [count(*)]
      filter: null
        DataFrame
            Row forward scan
            Interval forward scan on: trips
              intervals: [("2018-01-01T00:00:00.000000Z","2018-12-31T23:59:59.999999Z")]

This plan does not contain any SAMPLE BY nodes, which might initially be surprising. This is due to an optimisation pass which rewrites SAMPLE BY queries (which run single-threaded) into GROUP BY queries (which run in parallel).

This optimisation is specifically for ALIGN TO CALENDAR queries, the SAMPLE BY default. QuestDB provides two options for sampling data - ALIGN TO FIRST OBSERVATION and ALIGN TO CALENDAR. The former begins the sampling at the first timestamp in the dataset, and counts from there, whereas calendar alignment floors the timestamp to the nearest unit. Therefore, for calendar alignment with a stride of 5d, we expect each 5d bucket to start at 00:00:00Z,

Continuing, the above query is rewritten into something similar to³:

SELECT timestamp_floor('5d', pickup_datetime) as t,
       count
FROM trips
WHERE pickup_datetime IN '2018'
ORDER BY 1
LIMIT 5

In this plan, the WHERE clause has been transformed into the correct inclusive interval scan. The compiler generates an Async (parallel) Group By node, keyed by timestamp_floor. Last, we apply the ORDER BY and LIMIT through a limited sort, to ensure the GROUP BY result matches an ordered SAMPLE BY result.

This interval issue is not solely caused by the parallel GROUP BY optimisation, since the same flooring code is used behind the scenes. We can add an offset to drop out of the optimisation, and compare the results from a sequential SAMPLE BY

SELECT pickup_datetime as t, count
FROM trips
WHERE pickup_datetime IN '2018'
SAMPLE BY 5d ALIGN TO CALENDAR WITH OFFSET '10:00'
LIMIT 5

t	count
2017-12-30T10:00:00.000000Z	808413
2018-01-04T10:00:00.000000Z	1264083
2018-01-09T10:00:00.000000Z	1647305
2018-01-14T10:00:00.000000Z	1580716
2018-01-19T10:00:00.000000Z	1542316

The cause

So what is the cause of these incorrect timestamps? We look to the bucket generation for an answer:

SELECT timestamp_floor('5d', '2018-01-01')

timestamp_floor
2017-12-30T00:00:00.000000Z

Since this function is isolated, it doesn't know about its wider context, as the lower bound of an interval. We'd ideally like the buckets to start at 2018-01-01 and count onwards. Unfortunately, timestamp_floor only has one, fixed origin - the unix epoch i.e 0.

We can use some modulo arithmetic to see where the 2017-12-30 originates.

SELECT datediff('d', 0, '2018-01-01') % 5 as "2018-01-01",
       datediff('d', 0, '2017-12-30') % 5 as "2017-12-30"

2018-01-01	2017-12-30
2	0

2017-12-30 is 17530 days past the unix epoch, and this is divisible by 5. 2018-01-01 is two days further on, which causes it to be floored to the nearest multiple of 5d, which is 2017-12-30.

To resolve this issue, we can add a third parameter to timestamp_floor, an offset origin which is used as a new base for the flooring calculations.

But how do we know what value should be used for this new parameter?

A new origin

The obvious place to look for the origin is the WHERE clause. In the prior query, we have a clear interval, and could take the lower bound of the interval as our offset parameter.

But this clause is not guaranteed - what if the user wishes to sample their entire dataset?

SELECT pickup_datetime as t, count
FROM trips
SAMPLE BY 1M

Running this query over the 1.6 billion rows takes just a few seconds. From it, we get a useful breakdown of how many trips were recorded per month. This kind of query is not uncommon, and has no WHERE clause. If we were to use a non-unit stride like 5d, we can encounter misalignment issues.

To demonstrate this issue, we'll select the same rows as the earlier queries but without using the WHERE clause.

First, we find out the absolute offset of our rows from the end:

SELECT datediff('d', '2017-12-30', pickup_datetime) / 5
       as offset
FROM trips LIMIT -1

offset
109

Since the last entry is LIMIT -1, we need to jump back 109 rows, and then select 5 rows.

SELECT pickup_datetime as t, count
FROM trips
SAMPLE BY 5d
LIMIT -110, -105

t	count
2017-12-30T00:00:00.000000Z	1349468
2018-01-04T00:00:00.000000Z	1264083
2018-01-09T00:00:00.000000Z	1647305
2018-01-14T00:00:00.000000Z	1580716
2018-01-19T00:00:00.000000Z	1542316

As you can see, the problem reoccurs!

But even when a WHERE clause is provided, it can be complex. If the clause contains an OR condition, then we may not generate an Interval Scan node, and would have to interpret a complex set of filters to try to infer an appropriate lower bound.

It seems that we are asking too much of the WHERE clause, which is designed to filter data before it is sampled, and not designated to handle the intricacies of the output histogram.

Therefore, perhaps our SAMPLE BY syntax is too limited, and we need a better way to define the shape of our output data. The best source for this information is the user - let's give them a way to communicate that information!

The need for new syntax

Our popular open source repository, had an open feature request raised by one of our QuestDB champions, newskooler.

This was a request to extend the current filling behaviour of SAMPLE BY to allow to prefill and postfill datasets. Original SAMPLE BY would only fill missing values between existing rows in your result set.

Consider the following query:

SELECT pickup_datetime as t, max(fare_amount) as max_fare 
FROM trips 
SAMPLE BY 1s FROM '2009-01-01T00:00:00Z' TO '2009-01-01T00:00:05Z'  
FILL(NULL)  
LIMIT 5

t	max_fare
2009-01-01T00:00:00.000000Z	5.8
2009-01-01T00:00:01.000000Z	null
2009-01-01T00:00:02.000000Z	5.8
2009-01-01T00:00:03.000000Z	null
2009-01-01T00:00:04.000000Z	4.6

The dataset starts at the first row, and calculates buckets in steps of 1s. Where there are no trips, the max_fare column is filled with a placeholder value, in this case, null.

In some cases, you might wish to have a fixed time axis, for example, when plotting an annual chart. Let's see how this new syntax simplifies our workflow.

The trips dataset ends from 2019-07-01 onwards. We will take a monthly summary, extended with nulls to the whole year. This can be achieved with a LEFT JOIN:

SELECT fill.t as t, data.max_fare as max_fare FROM
(
       SELECT dateadd('M', (x-1)::INT,
              to_timestamp('2019-01-01', 'yyyy-MM-dd')
       ) as t,
       null AS max_fare
       FROM long_sequence(12)
) AS fill
LEFT JOIN
(
       SELECT pickup_datetime as t,
       max(fare_amount) as max_fare
       FROM trips
       WHERE pickup_datetime IN '2019'
       SAMPLE BY 1M FILL(NULL)
) AS data
ON data.t = fill.t

t	max_fare
2019-01-01T00:00:00.000000Z	760.0
2019-02-01T00:00:00.000000Z	750.0
2019-03-01T00:00:00.000000Z	800.0
2019-04-01T00:00:00.000000Z	907.0
2019-05-01T00:00:00.000000Z	983.5
2019-06-01T00:00:00.000000Z	780.0
2019-07-01T00:00:00.000000Z	null
2019-08-01T00:00:00.000000Z	null
2019-09-01T00:00:00.000000Z	null
2019-10-01T00:00:00.000000Z	null
2019-11-01T00:00:00.000000Z	null
2019-12-01T00:00:00.000000Z	null

The above query is not ergonomic, and the issue proposed an additional FROM-TO clause to solve this problem.

Let's try it out:

SELECT pickup_datetime as t,
       max(fare_amount) as max_fare
FROM trips
SAMPLE BY 1M FROM '2019-01-01' TO '2020-01-01' FILL(NULL)

That's a dramatic improvement!

The feature

Here's our original query, using the new FROM-TO clause.

SELECT pickup_datetime as t, count
FROM trips
SAMPLE BY 5d FROM '2018-01-01' TO '2019-01-01'
LIMIT 5

t	count
2018-01-01T00:00:00.000000Z	1226531
2018-01-06T00:00:00.000000Z	1468302
2018-01-11T00:00:00.000000Z	1604016
2018-01-16T00:00:00.000000Z	1677303
2018-01-21T00:00:00.000000Z	1563520

The FROM-TO clause uses a right-open interval, unlike BETWEEN, which is an inclusive interval.

With or without a FILL, the FROM lower bound will be used as the offset origin to orient the buckets.

Range filling

When using FILL(NULL) or FILL(VALUE), any missing buckets within the range will be filled.

SELECT pickup_datetime as t, count
FROM trips
SAMPLE BY 2y FROM '2007-01-01' TO '2023-01-01' FILL(NULL)

t	count
2007-01-01T00:00:00.000000Z	null
2009-01-01T00:00:00.000000Z	336976431
2011-01-01T00:00:00.000000Z	352405650
2013-01-01T00:00:00.000000Z	352587359
2015-01-01T00:00:00.000000Z	310389601
2017-01-01T00:00:00.000000Z	234948135
2019-01-01T00:00:00.000000Z	47292137
2021-01-01T00:00:00.000000Z	null

Prefilling

When FROM is used in isolation, then data will only be prefilled.

SELECT pickup_datetime as t, count
FROM trips
SAMPLE BY 2y FROM '2007-01-01' FILL(NULL)
LIMIT 3

t	count
2007-01-01T00:00:00.000000Z	null
2009-01-01T00:00:00.000000Z	336976431
2011-01-01T00:00:00.000000Z	352405650

Postfilling

When TO is used in isolation, then data will be postfilled.

However, keep in mind that without the FROM clause, you may encounter alignment issues again.

SELECT pickup_datetime as t, count
FROM trips
SAMPLE BY 2y TO '2023-01-01' FILL(NULL)
LIMIT -3

t	count
2018-01-01T00:00:00.000000Z	157992797
2020-01-01T00:00:00.000000Z	null
2022-01-01T00:00:00.000000Z	null

Optimised WHERE clause

When your query contains FROM-TO, and no WHERE clause over the designated timestamp, we optimise this and add the WHERE clause for you. This ensures that only relevant data is read.

For example, this query:

SELECT pickup_datetime as t, count
FROM trips
SAMPLE BY 5d FROM '2018-01-01' TO '2019-01-01'
LIMIT 5

gets transformed into this:

SELECT pickup_datetime as t, count
FROM trips
WHERE pickup_datetime >= '2018-01-01' AND pickup_datetime < '2019-01-01'
SAMPLE BY 5d FROM '2018-01-01' TO '2019-01-01'
LIMIT 5

We run EXPLAIN to verify that an Interval Scan is generated:

Sort light lo: 5
  keys: [t]
    Async Group By workers: 46
      keys: [t]
      values: [count(*)]
      filter: null
        DataFrame
            Row forward scan
            Interval forward scan on: trips
              intervals: [("2018-01-01T00:00:00.000000Z","2018-12-31T23:59:59.999999Z")]

The compiler only generates this node when it has a clear interval over the designated timestamp.

If you have an existing WHERE clause, but it does not consider designated timestamp, then we will still add the interval for you.

For example:

SELECT pickup_datetime as t, count
FROM trips
WHERE cab_type = 'yellow'
SAMPLE BY 5d FROM '2018-01-01' TO '2019-01-01'
LIMIT 5

gets transformed into:

SELECT pickup_datetime as t, count
FROM trips
WHERE pickup_datetime >= '2018-01-01'
  AND pickup_datetime < '2019-01-01'
  AND cab_type = 'yellow'
SAMPLE BY 5d FROM '2018-01-01' TO '2019-01-01'
LIMIT 5

When explained, this query also contains an Interval forward scan:

Sort light lo: 5
  keys: [t]
    Async JIT Group By workers: 46
      keys: [t]
      values: [count(*)]
      filter: cab_type='yellow'
        DataFrame
            Row forward scan
            Interval forward scan on: trips
              intervals: [("2018-01-01T00:00:00.000000Z","2018-12-31T23:59:59.999999Z")]

Limitations

There are some limitations for this feature, which we hope to improve in the future.

Keyed SAMPLE BY queries are not supported. These are queries that have columns other than the designated timestamp, and aggregation functions.
WITH OFFSET ... will be disregarded.
FILL(PREV) and FILL(LINEAR) are not supported
Parallel SAMPLE BY with FROM-TO does not support bind variables in the FROM-TO clause.

Conclusion

This article contained some insights into our continuing development of the SAMPLE BY feature.

If you are interested in QuestDB or time-series in general, take a look at our other blog posts, and try us out!.

See the documentation. ↩
This syntax is part of QuestDB's time-series extensions, see here. ↩
Specifying the GROUP BY clause is not necessary, as QuestDB supports implicit GROUP BY. ↩