ASOF Join — The "Do What I Mean" of the Database World
Dealing with time-series data means dealing with events — pieces of data that say "this thing happened at this moment". Describing "the thing" involves users, merchants, physical sensors, and so on. We denote each of these with an ID, so an event is basically a record full of IDs. To do anything meaningful with it, you must expand the ID into the info you have on that thing/person/organization. You keep this in some tables, so naturally you want a JOIN to bring it all in.
However, this info changes over time, and you have to keep the full history on it. You need the history so that you can pull in the particular info that was valid for an event as of the time it occurred, like this:
It's such an obvious concept, isn't it? But when you go write the SQL for it, you may hit a brick wall. WINDOW, ORDER BY, MAX, LIMIT, subselect, and you're still not getting what you want. Then you turn to Stack Overflow, and find a solution that involves ninja-level wielding of all those tools you were trying out but didn't quite get right. It's ten lines long, and you spend the afternoon figuring it out. Then you spend the next day trying to blend it into your existing query, which already had 10+ lines of heavyweight business logic.
You've probably hit a snag or two like this along your career, and this is what makes the ASOF JOIN so special: it's one of those rare moments where the database will just Do What You Mean, in a single keyword!
IoT example: correlate sensor data with recorded events
To make things specific, let's go with a fun example: an acceleration sensor mounted on SpaceX's Starship booster stage. In this example, we use the "as of" idea a bit differently than the above description. Instead of expanding an ID with time-sensitive metadata, we correlate the events from two tables. This is another common use case.
We have two tables:
accel_sensor
, recording the sensor's readingsevent
, where we note the start of each phase of the rocket launch
We want to combine the two tables so that every sensor data point gets annotated with the phase it belongs to.
The most recent Starship test was "Integrated Flight Test 4", occurring on June
6th, 2024. The booster went through a number of phases, starting with liftoff,
and ending in touchdown. The timestamp of the start of each phase is in the
event
table.
Sample from the accel_sensor
table:
ts | accel |
---|---|
2024-06-06 08:00:00 | 1 |
2024-06-06 08:00:01 | 1.1 |
2024-06-06 08:00:02 | 1.2 |
2024-06-06 08:00:03 | 1.3 |
2024-06-06 08:00:04 | 1.4 |
2024-06-06 08:00:05 | 1.5 |
2024-06-06 08:00:06 | 1.6 |
2024-06-06 08:00:07 | 1.7 |
2024-06-06 08:00:08 | 1.8 |
2024-06-06 08:00:09 | 1.9 |
2024-06-06 08:00:10 | 2.0 |
... | ... |
Full event
table:
ts | event |
---|---|
2024-06-06 08:00:00 | LIFTOFF |
2024-06-06 08:01:11 | MECO |
2024-06-06 08:01:16 | STAGE_SEP |
2024-06-06 08:01:28 | BOOSTBACK |
2024-06-06 08:02:03 | COASTING |
2024-06-06 08:02:07 | REENTRY |
2024-06-06 08:02:15 | LANDING_BURN |
2024-06-06 08:02:46 | TOUCHDOWN |
With these tables in place, we can issue this straightforward query:
SELECT s.ts ts, s.accel accel, e.event phase, e.ts phase_startedFROM accel_sensor sASOF JOIN event e;
And we'll get results like this (date left out for compactness):
ts | accel | phase | phase_started |
---|---|---|---|
... | ... | ... | ... |
08:01:08 | 3.4700 | LIFTOFF | 08:00:00 |
08:01:09 | 3.4800 | LIFTOFF | 08:00:00 |
08:01:10 | 3.4900 | LIFTOFF | 08:00:00 |
08:01:11 | 3.5000 | MECO | 08:01:11 |
08:01:12 | 2.0000 | MECO | 08:01:11 |
08:01:13 | 1.5000 | MECO | 08:01:11 |
08:01:14 | 1.0000 | MECO | 08:01:11 |
08:01:15 | 0.5000 | MECO | 08:01:11 |
08:01:16 | 0.0000 | STAGE_SEP | 08:01:16 |
08:01:17 | -0.2000 | STAGE_SEP | 08:01:16 |
08:01:18 | -0.4000 | STAGE_SEP | 08:01:16 |
... | ... | ... | ... |
The simplicity of that query feels almost magical. Did the database read my mind? I didn't even specify the join columns! It just... worked? Yes, in a time-series database, the timestamp is a first-class citizen and the database already knows where to find it in each table. You don't have to lift a finger!
Finance example: match buy and sell trades
Now let's look into something a bit more complex from the world of finance. We
have a single table trades
which contains trades executed from Coinbase for
BTCUSD.
The trades come from either side: buy
or sell
. We want to match up the buy
side with the sell
side, per trade, so that we can find the most recent in a
given moment. We'll take the difference in price between them to get the price
spread at that point in time.
There are two interesting things here:
- we join a single table with itself, using different subsets of it
- just matching on the most recent timestamp isn't enough — we must match the coin's symbol as well
To address the second point, we'll add a bit more syntax:
buy ASOF JOIN sell ON (symbol)
It's still quite compact — column name is the same on both sides, so we don't have to specify the full JOIN condition.
In order to prepare the two trade sides, we'll use the Common Table Expression (CTE) feature, which allows us to declare named sub-queries before the main one:
WITHbuy AS( SELECT timestamp, symbol, price FROM trades WHERE side = 'buy' ),sell AS( SELECT timestamp, symbol, price FROM trades WHERE side = 'sell' )
The rest is straightforward, we just SELECT what we're interested in:
SELECTbuy.timestamp timestamp,buy.symbol symbol,(buy.price - sell.price) spread
And that's it, here's the full query:
WITHbuy AS( SELECT timestamp, symbol, price FROM trades WHERE side = 'buy' ),sell AS( SELECT timestamp, symbol, price FROM trades WHERE side = 'sell' )SELECTbuy.timestamp timestamp,buy.symbol symbol,(buy.price - sell.price) spreadFROM buy ASOF JOIN sell ON (symbol);
With these buy- and sell-side trades:
timestamp | symbol | buy |
---|---|---|
2024-01-01 00:00:00.834 | BTC-USD | 42323.1 |
2024-01-01 00:00:00.853 | BTC-USD | 42288.59 |
2024-01-01 00:00:00.864 | BTC-USD | 42288.59 |
2024-01-01 00:00:03.889 | BTC-USD | 42297.6 |
2024-01-01 00:00:03.889 | BTC-USD | 42297.61 |
2024-01-01 00:00:03.889 | BTC-USD | 42297.62 |
2024-01-01 00:00:03.929 | BTC-USD | 42298.13 |
2024-01-01 00:00:04.948 | BTC-USD | 42288.75 |
2024-01-01 00:00:04.998 | BTC-USD | 42288.75 |
2024-01-01 00:00:05.578 | BTC-USD | 42283.21 |
timestamp | symbol | sell |
---|---|---|
2024-01-01 00:00:00.273 | BTC-USD | 42288.58 |
2024-01-01 00:00:00.844 | BTC-USD | 42320.0 |
2024-01-01 00:00:00.849 | BTC-USD | 42289.52 |
2024-01-01 00:00:01.175 | BTC-USD | 42293.38 |
2024-01-01 00:00:01.540 | BTC-USD | 42294.52 |
2024-01-01 00:00:01.589 | BTC-USD | 42292.8 |
2024-01-01 00:00:03.123 | BTC-USD | 42292.96 |
2024-01-01 00:00:03.564 | BTC-USD | 42291.41 |
2024-01-01 00:00:03.564 | BTC-USD | 42290.85 |
2024-01-01 00:00:04.538 | BTC-USD | 42288.75 |
2024-01-01 00:00:04.680 | BTC-USD | 42288.09 |
2024-01-01 00:00:05.377 | BTC-USD | 42285.96 |
We can expect this result from our query:
timestamp | symbol | spread |
---|---|---|
2024-01-01 00:00:00.834 | BTC-USD | 34.51 |
2024-01-01 00:00:00.853 | BTC-USD | -0.93 |
2024-01-01 00:00:00.864 | BTC-USD | -0.93 |
2024-01-01 00:00:03.889 | BTC-USD | 6.75 |
2024-01-01 00:00:03.889 | BTC-USD | 6.76 |
2024-01-01 00:00:03.889 | BTC-USD | 6.77 |
2024-01-01 00:00:03.929 | BTC-USD | 7.27 |
2024-01-01 00:00:04.948 | BTC-USD | 0.66 |
2024-01-01 00:00:04.998 | BTC-USD | 0.66 |
2024-01-01 00:00:05.578 | BTC-USD | -1.22 |
Getting creative: match with an event from the future
For our final example, let's do some proper analytics. We want to see the effect
of placing large trades. After seeing a buy
order filled for more than 10 BTC,
where did the price move five minutes later?
Surprisingly, we can use ASOF JOIN to get this with just a little bit of trickery. Instead of using the timestamp as-is, we'll subtract five minutes from it! That will make the event appear as though it occurred five minutes earlier, resulting in a match between a current trade and a trade from five minutes later.
Here we'll again need a bit more syntax. Since our sub-query for the future trades doesn't use the timestamp column directly, we must tell the query engine to use the result of the function as the designated timestamp:
(SELECT dateadd('m', -5, timestamp) timestamp, symbol, priceFROM trades WHERE side = 'buy') TIMESTAMP(timestamp);
For the rest, we just reuse the elements we've already seen in the previous example:
WITHpresent AS( SELECT timestamp, symbol, priceFROM trades WHERE side = 'buy' AND amount > 10 ),future AS( (SELECT dateadd('m', -5, timestamp) timestamp, symbol, priceFROM trades WHERE side = 'buy') timestamp(timestamp) )SELECTpresent.timestamp,present.symbol,((future.price - present.price) / present.price) * 100 price_delta_pctFROM present ASOF JOIN future ON (symbol);
Here's a sample from the results for BTC in 2024:
timestamp | symbol | price_delta_pct |
---|---|---|
2024-01-02 00:57:23.419 | BTC-USD | 0.34 |
2024-01-02 00:57:23.466 | BTC-USD | 0.34 |
2024-01-02 00:57:47.778 | BTC-USD | 0.39 |
2024-01-02 15:00:10.449 | BTC-USD | -0.19 |
2024-01-02 15:00:16.466 | BTC-USD | -0.22 |
2024-01-05 02:00:18.178 | BTC-USD | 0.2 |
2024-01-05 02:00:18.183 | BTC-USD | -0.05 |
2024-01-08 18:00:07.242 | BTC-USD | 0.59 |
2024-01-08 18:00:07.242 | BTC-USD | 0.42 |
2024-01-08 18:00:39.500 | BTC-USD | 0.07 |
The price seems to have moved up most of the time... but not all of the time. The market is unpredictable, who would've guessed!
ASOF JOIN support across popular databases
Let's have a quick overview of the syntax needed to get the ASOF JOIN in a few of the popular databases. We'll use the query from the first example, it's the simplest one and focuses just on the "as of" aspect:
SELECT s.ts ts, s.accel accel, e.event phaseFROM accel_sensor sASOF JOIN event e;
DuckDB
In DuckDB, it's almost the same, you just have to spell out the "designated timestamp" column:
SELECT s.ts ts, s.accel accel, e.event phaseFROM accel_sensor sASOF JOIN event e USING (ts);
Clickhouse
With ClickHouse it's a bit more tricky — it doesn't do pure ASOF joins, the "as of" condition must be added on top of a regular equality condition (the "equijoin"). However, we don't have any such condition, so we'll need to hack our way through a bit. We slap on a dummy boolean "id" column, which is a constant "false" everywhere. That allows us to write it as follows:
SELECT s.ts ts, s.accel accel, e.event phaseFROM accel_sensor sASOF JOIN event e ON s.ts >= e.ts AND s.id = e.id;
TimescaleDB
TimescaleDB is pure PostgreSQL in terms of syntax, it's a plugin that optimizes time-series queries.
PostgreSQL does not feature the ASOF JOIN, so we have to get into that "ninja-level wielding" of anything else the database offers.
Here's one way, using the LATERAL JOIN. This kind of join allows the inner
select to observe the value of s.ts
from the outer select, for each row
selected by the outer select. This in turn allows us to set the filter to
e.ts <= s.ts
.
We also need the "boolean id" hack, just like with ClickHouse. Further, we must
use ORDER BY e.ts DESC
so that we get the most recent event as the first
result, and finally we use LIMIT 1
to take just that top result:
SELECT s.ts ts, s.accel accel, e.event phaseFROM accel_sensor sLEFT JOIN LATERAL (SELECT * from event e WHERE e.ts <= s.ts ORDER BY e.ts DESC LIMIT 1) e on s.id = e.id;
Conclusion
I hope I managed to share a bit of enthusiasm I have for this lovely feature of the ASOF JOIN! Its essence is aligning two things that vary over time, each recorded at independent points in time. It's such a natural concept that it keeps popping up in all kinds of use cases around time-series data.
QuestDB supports a few more variations on this theme: the LT JOIN and the SPLICE JOIN. Check out our documentation to find out more.
Do you have your own cool use case for ASOF JOIN?
Let us know on social media or in our lively Community Forum!