ASOF Join — The "Do What I Mean" of the Database World

QuestDB is a high performance time-series database with SQL analytics that can power through data ingestion and analysis. It's open source and integrates with many tools and languages. Give us a try!

Dealing with time-series data means dealing with events — pieces of data that say "this thing happened at this moment". Describing "the thing" involves users, merchants, physical sensors, and so on. We denote each of these with an ID, so an event is basically a record full of IDs. To do anything meaningful with it, you must expand the ID into the info you have on that thing/person/organization. You keep this in some tables, so naturally you want a JOIN to bring it all in.

However, this info changes over time, and you have to keep the full history on it. You need the history so that you can pull in the particular info that was valid for an event as of the time it occurred, like this:

Diagram shows two vertical timelines with events on them.
Arrows connect each event on the left with the most recent event on the right.
ASOF JOIN illustrated

It's such an obvious concept, isn't it? But when you go write the SQL for it, you may hit a brick wall. WINDOW, ORDER BY, MAX, LIMIT, subselect, and you're still not getting what you want. Then you turn to Stack Overflow, and find a solution that involves ninja-level wielding of all those tools you were trying out but didn't quite get right. It's ten lines long, and you spend the afternoon figuring it out. Then you spend the next day trying to blend it into your existing query, which already had 10+ lines of heavyweight business logic.

You've probably hit a snag or two like this along your career, and this is what makes the ASOF JOIN so special: it's one of those rare moments where the database will just Do What You Mean, in a single keyword!

IoT example: correlate sensor data with recorded events

To make things specific, let's go with a fun example: an acceleration sensor mounted on SpaceX's Starship booster stage. In this example, we use the "as of" idea a bit differently than the above description. Instead of expanding an ID with time-sensitive metadata, we correlate the events from two tables. This is another common use case.

We have two tables:

  1. accel_sensor, recording the sensor's readings
  2. event, where we note the start of each phase of the rocket launch

We want to combine the two tables so that every sensor data point gets annotated with the phase it belongs to.

The most recent Starship test was "Integrated Flight Test 4", occurring on June 6th, 2024. The booster went through a number of phases, starting with liftoff, and ending in touchdown. The timestamp of the start of each phase is in the event table.

Sample from the accel_sensor table:

tsaccel
2024-06-06 08:00:001
2024-06-06 08:00:011.1
2024-06-06 08:00:021.2
2024-06-06 08:00:031.3
2024-06-06 08:00:041.4
2024-06-06 08:00:051.5
2024-06-06 08:00:061.6
2024-06-06 08:00:071.7
2024-06-06 08:00:081.8
2024-06-06 08:00:091.9
2024-06-06 08:00:102.0
......

Full event table:

tsevent
2024-06-06 08:00:00LIFTOFF
2024-06-06 08:01:11MECO
2024-06-06 08:01:16STAGE_SEP
2024-06-06 08:01:28BOOSTBACK
2024-06-06 08:02:03COASTING
2024-06-06 08:02:07REENTRY
2024-06-06 08:02:15LANDING_BURN
2024-06-06 08:02:46TOUCHDOWN

With these tables in place, we can issue this straightforward query:

SELECT s.ts ts, s.accel accel, e.event phase, e.ts phase_started
FROM accel_sensor s
ASOF JOIN event e;

And we'll get results like this (date left out for compactness):

tsaccelphasephase_started
............
08:01:083.4700LIFTOFF08:00:00
08:01:093.4800LIFTOFF08:00:00
08:01:103.4900LIFTOFF08:00:00
08:01:113.5000MECO08:01:11
08:01:122.0000MECO08:01:11
08:01:131.5000MECO08:01:11
08:01:141.0000MECO08:01:11
08:01:150.5000MECO08:01:11
08:01:160.0000STAGE_SEP08:01:16
08:01:17-0.2000STAGE_SEP08:01:16
08:01:18-0.4000STAGE_SEP08:01:16
............

The simplicity of that query feels almost magical. Did the database read my mind? I didn't even specify the join columns! It just... worked? Yes, in a time-series database, the timestamp is a first-class citizen and the database already knows where to find it in each table. You don't have to lift a finger!

Finance example: match buy and sell trades

Now let's look into something a bit more complex from the world of finance. We have a single table trades which contains trades executed from Coinbase for BTCUSD.

The trades come from either side: buy or sell. We want to match up the buy side with the sell side, per trade, so that we can find the most recent in a given moment. We'll take the difference in price between them to get the price spread at that point in time.

There are two interesting things here:

  1. we join a single table with itself, using different subsets of it
  2. just matching on the most recent timestamp isn't enough — we must match the coin's symbol as well

To address the second point, we'll add a bit more syntax:

buy ASOF JOIN sell ON (symbol)

It's still quite compact — column name is the same on both sides, so we don't have to specify the full JOIN condition.

In order to prepare the two trade sides, we'll use the Common Table Expression (CTE) feature, which allows us to declare named sub-queries before the main one:

WITH
buy AS
( SELECT timestamp, symbol, price FROM trades WHERE side = 'buy' ),
sell AS
( SELECT timestamp, symbol, price FROM trades WHERE side = 'sell' )

The rest is straightforward, we just SELECT what we're interested in:

SELECT
buy.timestamp timestamp,
buy.symbol symbol,
(buy.price - sell.price) spread

And that's it, here's the full query:

WITH
buy AS
( SELECT timestamp, symbol, price FROM trades WHERE side = 'buy' ),
sell AS
( SELECT timestamp, symbol, price FROM trades WHERE side = 'sell' )

SELECT
buy.timestamp timestamp,
buy.symbol symbol,
(buy.price - sell.price) spread
FROM buy ASOF JOIN sell ON (symbol);

With these buy- and sell-side trades:

timestampsymbolbuy
2024-01-01 00:00:00.834BTC-USD42323.1
2024-01-01 00:00:00.853BTC-USD42288.59
2024-01-01 00:00:00.864BTC-USD42288.59
2024-01-01 00:00:03.889BTC-USD42297.6
2024-01-01 00:00:03.889BTC-USD42297.61
2024-01-01 00:00:03.889BTC-USD42297.62
2024-01-01 00:00:03.929BTC-USD42298.13
2024-01-01 00:00:04.948BTC-USD42288.75
2024-01-01 00:00:04.998BTC-USD42288.75
2024-01-01 00:00:05.578BTC-USD42283.21
timestampsymbolsell
2024-01-01 00:00:00.273BTC-USD42288.58
2024-01-01 00:00:00.844BTC-USD42320.0
2024-01-01 00:00:00.849BTC-USD42289.52
2024-01-01 00:00:01.175BTC-USD42293.38
2024-01-01 00:00:01.540BTC-USD42294.52
2024-01-01 00:00:01.589BTC-USD42292.8
2024-01-01 00:00:03.123BTC-USD42292.96
2024-01-01 00:00:03.564BTC-USD42291.41
2024-01-01 00:00:03.564BTC-USD42290.85
2024-01-01 00:00:04.538BTC-USD42288.75
2024-01-01 00:00:04.680BTC-USD42288.09
2024-01-01 00:00:05.377BTC-USD42285.96

We can expect this result from our query:

timestampsymbolspread
2024-01-01 00:00:00.834BTC-USD34.51
2024-01-01 00:00:00.853BTC-USD-0.93
2024-01-01 00:00:00.864BTC-USD-0.93
2024-01-01 00:00:03.889BTC-USD6.75
2024-01-01 00:00:03.889BTC-USD6.76
2024-01-01 00:00:03.889BTC-USD6.77
2024-01-01 00:00:03.929BTC-USD7.27
2024-01-01 00:00:04.948BTC-USD0.66
2024-01-01 00:00:04.998BTC-USD0.66
2024-01-01 00:00:05.578BTC-USD-1.22

Getting creative: match with an event from the future

For our final example, let's do some proper analytics. We want to see the effect of placing large trades. After seeing a buy order filled for more than 10 BTC, where did the price move five minutes later?

Surprisingly, we can use ASOF JOIN to get this with just a little bit of trickery. Instead of using the timestamp as-is, we'll subtract five minutes from it! That will make the event appear as though it occurred five minutes earlier, resulting in a match between a current trade and a trade from five minutes later.

Here we'll again need a bit more syntax. Since our sub-query for the future trades doesn't use the timestamp column directly, we must tell the query engine to use the result of the function as the designated timestamp:

(SELECT dateadd('m', -5, timestamp) timestamp, symbol, price
FROM trades WHERE side = 'buy') TIMESTAMP(timestamp);

For the rest, we just reuse the elements we've already seen in the previous example:

WITH
present AS
( SELECT timestamp, symbol, price
FROM trades WHERE side = 'buy' AND amount > 10 ),
future AS
( (SELECT dateadd('m', -5, timestamp) timestamp, symbol, price
FROM trades WHERE side = 'buy') timestamp(timestamp) )

SELECT
present.timestamp,
present.symbol,
((future.price - present.price) / present.price) * 100 price_delta_pct
FROM present ASOF JOIN future ON (symbol);

Here's a sample from the results for BTC in 2024:

timestampsymbolprice_delta_pct
2024-01-02 00:57:23.419BTC-USD0.34
2024-01-02 00:57:23.466BTC-USD0.34
2024-01-02 00:57:47.778BTC-USD0.39
2024-01-02 15:00:10.449BTC-USD-0.19
2024-01-02 15:00:16.466BTC-USD-0.22
2024-01-05 02:00:18.178BTC-USD0.2
2024-01-05 02:00:18.183BTC-USD-0.05
2024-01-08 18:00:07.242BTC-USD0.59
2024-01-08 18:00:07.242BTC-USD0.42
2024-01-08 18:00:39.500BTC-USD0.07

The price seems to have moved up most of the time... but not all of the time. The market is unpredictable, who would've guessed!

Let's have a quick overview of the syntax needed to get the ASOF JOIN in a few of the popular databases. We'll use the query from the first example, it's the simplest one and focuses just on the "as of" aspect:

SELECT s.ts ts, s.accel accel, e.event phase
FROM accel_sensor s
ASOF JOIN event e;

DuckDB

In DuckDB, it's almost the same, you just have to spell out the "designated timestamp" column:

SELECT s.ts ts, s.accel accel, e.event phase
FROM accel_sensor s
ASOF JOIN event e USING (ts);

Clickhouse

With ClickHouse it's a bit more tricky — it doesn't do pure ASOF joins, the "as of" condition must be added on top of a regular equality condition (the "equijoin"). However, we don't have any such condition, so we'll need to hack our way through a bit. We slap on a dummy boolean "id" column, which is a constant "false" everywhere. That allows us to write it as follows:

SELECT s.ts ts, s.accel accel, e.event phase
FROM accel_sensor s
ASOF JOIN event e ON s.ts >= e.ts AND s.id = e.id;

TimescaleDB

TimescaleDB is pure PostgreSQL in terms of syntax, it's a plugin that optimizes time-series queries.

PostgreSQL does not feature the ASOF join, so we have to get into that "ninja-level wielding" of anything else the database offers.

Here's one way, using the LATERAL JOIN. This kind of join allows the inner select to observe the value of s.ts from the outer select, for each row selected by the outer select. This in turn allows us to set the filter to e.ts <= s.ts.

We also need the "boolean id" hack, just like with ClickHouse. Further, we must use ORDER BY e.ts DESC so that we get the most recent event as the first result, and finally we use LIMIT 1 to take just that top result:

SELECT s.ts ts, s.accel accel, e.event phase
FROM accel_sensor s
LEFT JOIN LATERAL (
SELECT * from event e WHERE e.ts <= s.ts ORDER BY e.ts DESC LIMIT 1
) e on s.id = e.id;

Conclusion

I hope I managed to share a bit of enthusiasm I have for this lovely feature of the ASOF JOIN! Its essence is aligning two things that vary over time, each recorded at independent points in time. It's such a natural concept that it keeps popping up in all kinds of use cases around time-series data.

QuestDB supports a few more variations on this theme: the LT JOIN and the SPLICE JOIN. Check out our documentation to find out more.

Do you have your own cool use case for ASOF JOIN?

Let us know on social media or in our lively Community Forum!

Subscribe to our newsletters for the latest. Secure and never shared or sold.