The story of our SAMPLE BY enhancements
It all begins with a simple query, and a not so simple result.
SELECT pickup_datetime as t, countFROM tripsWHERE pickup_datetime IN '2018'SAMPLE BY 5dLIMIT 5
In this query, we use QuestDB's SAMPLE BY
1 syntax to downsample data from
the NYC Taxi dataset. This query will sub-sample the trips with start dates in
2018 into buckets of 5 days, and then take the first 5 rows.
Or will it?
The bug
Running this query, we get the following, curious, result:
t | count |
---|---|
2017-12-30T00:00:00.000000Z | 808413 |
2018-01-04T00:00:00.000000Z | 1264083 |
2018-01-09T00:00:00.000000Z | 1647305 |
2018-01-14T00:00:00.000000Z | 1580716 |
2018-01-19T00:00:00.000000Z | 1542316 |
Despite filtering the data to be IN '2018'
2, the first bucket starts in
'2017'
. Most people would not expect this result set. With 2017
as our first
result, it looks more like a bug in the SAMPLE BY
code. However, let's keep an
open mind as we investigate further.
EXPLAINing things
We obtain the query plan by re-running the query with EXPLAIN
:
EXPLAINSELECT pickup_datetime as t, countFROM tripsWHERE pickup_datetime IN '2018'SAMPLE BY 5dLIMIT 5
This is the plan:
Sort light lo: 5
keys: [t]
Async Group By workers: 46
keys: [t]
values: [count(*)]
filter: null
DataFrame
Row forward scan
Interval forward scan on: trips
intervals: [("2018-01-01T00:00:00.000000Z","2018-12-31T23:59:59.999999Z")]
This plan does not contain any SAMPLE BY
nodes, which might initially be
surprising. This is due to an optimisation pass which rewrites SAMPLE BY
queries (which run single-threaded) into GROUP BY
queries (which run in
parallel).
This optimisation is specifically for ALIGN TO CALENDAR
queries, the
SAMPLE BY
default. QuestDB provides two options for sampling data -
ALIGN TO FIRST OBSERVATION
and ALIGN TO CALENDAR
. The former begins the
sampling at the first timestamp in the dataset, and counts from there, whereas
calendar alignment floors the timestamp to the nearest unit. Therefore, for
calendar alignment with a stride of 5d
, we expect each 5d
bucket to start at
00:00:00Z
,
Continuing, the above query is rewritten into something similar to3:
SELECT timestamp_floor('5d', pickup_datetime) as t,countFROM tripsWHERE pickup_datetime IN '2018'ORDER BY 1LIMIT 5
In this plan, the WHERE
clause has been transformed into the correct inclusive
interval scan. The compiler generates an Async (parallel) Group By node, keyed
by timestamp_floor
. Last, we apply the ORDER BY
and LIMIT
through a
limited sort, to ensure the GROUP BY
result matches an ordered SAMPLE BY
result.
This interval issue is not solely caused by the parallel GROUP BY
optimisation, since the same flooring code is used behind the scenes. We can add
an offset to drop out of the optimisation, and compare the results from a
sequential SAMPLE BY
SELECT pickup_datetime as t, countFROM tripsWHERE pickup_datetime IN '2018'SAMPLE BY 5d ALIGN TO CALENDAR WITH OFFSET '10:00'LIMIT 5
t | count |
---|---|
2017-12-30T10:00:00.000000Z | 808413 |
2018-01-04T10:00:00.000000Z | 1264083 |
2018-01-09T10:00:00.000000Z | 1647305 |
2018-01-14T10:00:00.000000Z | 1580716 |
2018-01-19T10:00:00.000000Z | 1542316 |
The cause
So what is the cause of these incorrect timestamps? We look to the bucket generation for an answer:
SELECT timestamp_floor('5d', '2018-01-01')
timestamp_floor |
---|
2017-12-30T00:00:00.000000Z |
Since this function is isolated, it doesn't know about its wider context, as the
lower bound of an interval. We'd ideally like the buckets to start at
2018-01-01
and count onwards. Unfortunately, timestamp_floor
only has one,
fixed origin - the unix epoch i.e 0
.
We can use some modulo arithmetic to see where the 2017-12-30
originates.
SELECT datediff('d', 0, '2018-01-01') % 5 as "2018-01-01",datediff('d', 0, '2017-12-30') % 5 as "2017-12-30"
2018-01-01 | 2017-12-30 |
---|---|
2 | 0 |
2017-12-30
is 17530 days past the unix epoch, and this is divisible by 5.
2018-01-01
is two days further on, which causes it to be floored to the
nearest multiple of 5d
, which is 2017-12-30
.
To resolve this issue, we can add a third parameter to timestamp_floor
, an
offset
origin which is used as a new base for the flooring calculations.
But how do we know what value should be used for this new parameter?
A new origin
The obvious place to look for the origin is the WHERE
clause. In the prior
query, we have a clear interval, and could take the lower bound of the interval
as our offset
parameter.
But this clause is not guaranteed - what if the user wishes to sample their entire dataset?
SELECT pickup_datetime as t, countFROM tripsSAMPLE BY 1M
Running this query over the 1.6 billion rows takes just a few seconds. From it,
we get a useful breakdown of how many trips were recorded per month. This kind
of query is not uncommon, and has no WHERE
clause. If we were to use a
non-unit stride like 5d
, we can encounter misalignment issues.
To demonstrate this issue, we'll select the same rows as the earlier queries but
without using the WHERE
clause.
First, we find out the absolute offset of our rows from the end:
SELECT datediff('d', '2017-12-30', pickup_datetime) / 5as offsetFROM trips LIMIT -1
offset |
---|
109 |
Since the last entry is LIMIT -1
, we need to jump back 109 rows, and then
select 5 rows.
SELECT pickup_datetime as t, countFROM tripsSAMPLE BY 5dLIMIT -110, -105
t | count |
---|---|
2017-12-30T00:00:00.000000Z | 1349468 |
2018-01-04T00:00:00.000000Z | 1264083 |
2018-01-09T00:00:00.000000Z | 1647305 |
2018-01-14T00:00:00.000000Z | 1580716 |
2018-01-19T00:00:00.000000Z | 1542316 |
As you can see, the problem reoccurs!
But even when a WHERE
clause is provided, it can be complex. If the clause
contains an OR
condition, then we may not generate an Interval Scan
node,
and would have to interpret a complex set of filters to try to infer an
appropriate lower bound.
It seems that we are asking too much of the WHERE
clause, which is designed to
filter data before it is sampled, and not designated to handle the intricacies
of the output histogram.
Therefore, perhaps our SAMPLE BY
syntax is too limited, and we need a better
way to define the shape of our output data. The best source for this information
is the user - let's give them a way to communicate that information!
The need for new syntax
Our popular open source repository, had an open feature request raised by one of our QuestDB champions, newskooler.
This was a request to extend the current filling behaviour of SAMPLE BY
to
allow to prefill and postfill datasets. Original SAMPLE BY
would only fill
missing values between existing rows in your result set.
Consider the following query:
SELECT pickup_datetime as t, max(fare_amount) as max_fareFROM tripsSAMPLE BY 1s FROM '2009-01-01T00:00:00Z' TO '2009-01-01T00:00:05Z'FILL(NULL)LIMIT 5
t | max_fare |
---|---|
2009-01-01T00:00:00.000000Z | 5.8 |
2009-01-01T00:00:01.000000Z | null |
2009-01-01T00:00:02.000000Z | 5.8 |
2009-01-01T00:00:03.000000Z | null |
2009-01-01T00:00:04.000000Z | 4.6 |
The dataset starts at the first row, and calculates buckets in steps of 1s
.
Where there are no trips, the max_fare
column is filled with a placeholder
value, in this case, null
.
In some cases, you might wish to have a fixed time axis, for example, when plotting an annual chart. Let's see how this new syntax simplifies our workflow.
The trips dataset ends from 2019-07-01
onwards. We will take a monthly
summary, extended with null
s to the whole year. This can be achieved with a
LEFT JOIN
:
SELECT fill.t as t, data.max_fare as max_fare FROM(SELECT dateadd('M', (x-1)::INT,to_timestamp('2019-01-01', 'yyyy-MM-dd')) as t,null AS max_fareFROM long_sequence(12)) AS fillLEFT JOIN(SELECT pickup_datetime as t,max(fare_amount) as max_fareFROM tripsWHERE pickup_datetime IN '2019'SAMPLE BY 1M FILL(NULL)) AS dataON data.t = fill.t
t | max_fare |
---|---|
2019-01-01T00:00:00.000000Z | 760.0 |
2019-02-01T00:00:00.000000Z | 750.0 |
2019-03-01T00:00:00.000000Z | 800.0 |
2019-04-01T00:00:00.000000Z | 907.0 |
2019-05-01T00:00:00.000000Z | 983.5 |
2019-06-01T00:00:00.000000Z | 780.0 |
2019-07-01T00:00:00.000000Z | null |
2019-08-01T00:00:00.000000Z | null |
2019-09-01T00:00:00.000000Z | null |
2019-10-01T00:00:00.000000Z | null |
2019-11-01T00:00:00.000000Z | null |
2019-12-01T00:00:00.000000Z | null |
The above query is not ergonomic, and the issue proposed an additional FROM-TO
clause to solve this problem.
Let's try it out:
SELECT pickup_datetime as t,max(fare_amount) as max_fareFROM tripsSAMPLE BY 1M FROM '2019-01-01' TO '2020-01-01' FILL(NULL)
That's a dramatic improvement!
The feature
Here's our original query, using the new FROM-TO
clause.
SELECT pickup_datetime as t, countFROM tripsSAMPLE BY 5d FROM '2018-01-01' TO '2019-01-01'LIMIT 5
t | count |
---|---|
2018-01-01T00:00:00.000000Z | 1226531 |
2018-01-06T00:00:00.000000Z | 1468302 |
2018-01-11T00:00:00.000000Z | 1604016 |
2018-01-16T00:00:00.000000Z | 1677303 |
2018-01-21T00:00:00.000000Z | 1563520 |
The FROM-TO
clause uses a right-open interval, unlike BETWEEN
, which is an
inclusive interval.
With or without a FILL
, the FROM
lower bound will be used as the offset
origin to orient the buckets.
Range filling
When using FILL(NULL)
or FILL(VALUE)
, any missing buckets within the range
will be filled.
SELECT pickup_datetime as t, countFROM tripsSAMPLE BY 2y FROM '2007-01-01' TO '2023-01-01' FILL(NULL)
t | count |
---|---|
2007-01-01T00:00:00.000000Z | null |
2009-01-01T00:00:00.000000Z | 336976431 |
2011-01-01T00:00:00.000000Z | 352405650 |
2013-01-01T00:00:00.000000Z | 352587359 |
2015-01-01T00:00:00.000000Z | 310389601 |
2017-01-01T00:00:00.000000Z | 234948135 |
2019-01-01T00:00:00.000000Z | 47292137 |
2021-01-01T00:00:00.000000Z | null |
Prefilling
When FROM
is used in isolation, then data will only be prefilled.
SELECT pickup_datetime as t, countFROM tripsSAMPLE BY 2y FROM '2007-01-01' FILL(NULL)LIMIT 3
t | count |
---|---|
2007-01-01T00:00:00.000000Z | null |
2009-01-01T00:00:00.000000Z | 336976431 |
2011-01-01T00:00:00.000000Z | 352405650 |
Postfilling
When TO
is used in isolation, then data will be postfilled.
However, keep in mind that without the FROM
clause, you may encounter
alignment issues again.
SELECT pickup_datetime as t, countFROM tripsSAMPLE BY 2y TO '2023-01-01' FILL(NULL)LIMIT -3
t | count |
---|---|
2018-01-01T00:00:00.000000Z | 157992797 |
2020-01-01T00:00:00.000000Z | null |
2022-01-01T00:00:00.000000Z | null |
Optimised WHERE clause
When your query contains FROM-TO
, and no WHERE
clause over the designated
timestamp, we optimise this and add the WHERE
clause for you. This ensures
that only relevant data is read.
For example, this query:
SELECT pickup_datetime as t, countFROM tripsSAMPLE BY 5d FROM '2018-01-01' TO '2019-01-01'LIMIT 5
gets transformed into this:
SELECT pickup_datetime as t, countFROM tripsWHERE pickup_datetime >= '2018-01-01' AND pickup_datetime < '2019-01-01'SAMPLE BY 5d FROM '2018-01-01' TO '2019-01-01'LIMIT 5
We run EXPLAIN
to verify that an Interval Scan
is generated:
Sort light lo: 5
keys: [t]
Async Group By workers: 46
keys: [t]
values: [count(*)]
filter: null
DataFrame
Row forward scan
Interval forward scan on: trips
intervals: [("2018-01-01T00:00:00.000000Z","2018-12-31T23:59:59.999999Z")]
The compiler only generates this node when it has a clear interval over the designated timestamp.
If you have an existing WHERE
clause, but it does not consider designated
timestamp, then we will still add the interval for you.
For example:
SELECT pickup_datetime as t, countFROM tripsWHERE cab_type = 'yellow'SAMPLE BY 5d FROM '2018-01-01' TO '2019-01-01'LIMIT 5
gets transformed into:
SELECT pickup_datetime as t, countFROM tripsWHERE pickup_datetime >= '2018-01-01'AND pickup_datetime < '2019-01-01'AND cab_type = 'yellow'SAMPLE BY 5d FROM '2018-01-01' TO '2019-01-01'LIMIT 5
When explained, this query also contains an Interval forward scan
:
Sort light lo: 5
keys: [t]
Async JIT Group By workers: 46
keys: [t]
values: [count(*)]
filter: cab_type='yellow'
DataFrame
Row forward scan
Interval forward scan on: trips
intervals: [("2018-01-01T00:00:00.000000Z","2018-12-31T23:59:59.999999Z")]
Limitations
There are some limitations for this feature, which we hope to improve in the future.
- Keyed
SAMPLE BY
queries are not supported. These are queries that have columns other than the designated timestamp, and aggregation functions. WITH OFFSET ...
will be disregarded.FILL(PREV)
andFILL(LINEAR)
are not supported- Parallel SAMPLE BY with FROM-TO does not support bind variables in the FROM-TO clause.
Conclusion
This article contained some insights into our continuing development of the
SAMPLE BY
feature.
If you are interested in QuestDB or time-series in general, take a look at our other blog posts, and try us out!.
Footnotes
-
See the documentation. ↩
-
This syntax is part of QuestDB's time-series extensions, see here. ↩
-
Specifying the
GROUP BY
clause is not necessary, as QuestDB supports implicit GROUP BY. ↩