Create a sample database

This guide walks you through creating a sample dataset.

It utilizes rnd_ functions and basic SQL grammar to generate 'mock' data of specific types.

For most applications, you will import your data using methods like the InfluxDB Line Protocol, CSV imports, or integration with third-party tools such as Telegraf, Kafka, or Prometheus. If your interest lies in data ingestion rather than generation, refer to our ingestion overview. Alternatively, the QuestDB demo instance offers a practical way to explore data creation and manipulation without setting up your dataset.

All that said, in this tutorial you will learn how to:

  1. Create tables
  2. Populate tables with sample data
  3. Run simple and advanced queries
  4. Delete tables

Before we begin...​

All commands are run through the Web Console accessible at http://localhost:9000.

You can also run the same SQL via the Postgres endpoint or the REST API.

If QuestDB is not running locally, checkout the quick start.

Creating a table​

With QuestDB running, the first step is to create a table.

We'll start with one representing financial market data. Then in the insert section, we'll create another pair of tables representing temperature sensors and their readings.

Let's start by creating the ticker_price table:

CREATE TABLE ticker_price (
ts TIMESTAMP,
ticker SYMBOL,
side SYMBOL,
price DOUBLE
) TIMESTAMP(ts) PARTITION BY DAY WAL
DEDUP UPSERT KEYS(ts, ticker);

This is a basic yet robust table. It applies SYMBOLs for ticker and side, a price, and a designated timestamp. It's partitioned by day, is a WAL table, and deduplicates the timestamp and ticker columns. As the links above show, there's lots to unpack in this table! Feel free to learn more about the nuances.

We've done all of this to match the nature of how we'll query this data. We're focused on a the flow of the market, the pulse of the market's day-to-day, hence we've partitioned it as such. We're also leery of duplicates, for accuracy of data, so we'll ensure that if timestamps are identical that we do not create a duplicate. Timestamps are essential for time-series analysis.

We'll proceed forward to INSERT.

Inserting data​

Financial market data​

Let's populate our ticker_price table with procedurally-generated data:

Insert as SELECT
INSERT INTO ticker_price
SELECT
timestamp_sequence('2023-01-01T00:00:00', 60000L * x) ts, -- Generate a timestamp every minute starting from Jan 1, 2023
rnd_str('AAPL', 'GOOGL', 'MSFT', 'AMZN', 'FB') ticker, -- Random ticker symbols
rnd_str('BUY', 'SELL') side, -- Random side (BUY or SELL)
rnd_double() * 100 + 10 price -- Random price between 10.0 and 110.0
FROM long_sequence(10000) x;

Our ticker_price table now contains 10,000 randomly-generated trades. The comments indicate how we've structured our random data. We picked a few companies, BUY vs. SELL, and created a timestamp every minute. We've dictated the overall number of rows generated via long_sequence(10000). We can bump that up, if we want.

We've also conservatively generated a timestamp per minute, even though in reality trades against these companies are likely much more frequent. This helps keep our basic examples basic.

Now let's look at the table and its data:

'ticker_price';

It will look similar to this, albeit with alternative randomized values.

tstickersideprice
2023-01-01T00:00:00.000000ZAAPLSELL91.166295916161
2023-01-01T00:00:00.060000ZAMZNSELL68.260944075324
2023-01-01T00:00:00.180000ZFBSELL99.29323755252
2023-01-01T00:00:00.360000ZMSFTBUY57.891314056011
2023-01-01T00:00:00.600000ZAAPLSELL10.249165832651

That's some fake market data. What about, say, sensor data?

Sensors and readings​

This next example will create and populate two more tables. One table will contain the metadata of our sensors, and the other will contain the actual readings (payload data) from these sensors. In both cases, we will create the table and generate the data at the same time.

This combines the CREATE & SELECT operations to perform a create-and-insert:

Create table as, readings
CREATE TABLE readings
AS(
SELECT
x ID,
timestamp_sequence(to_timestamp('2019-10-17T00:00:00', 'yyyy-MM-ddTHH:mm:ss'), rnd_long(1,10,0) * 100000L) ts,
rnd_double(0)*8 + 15 temp,
rnd_long(0, 10000, 0) sensorId
FROM long_sequence(10000000) x)
TIMESTAMP(ts)
PARTITION BY MONTH DEDUP UPSERT KEYS(ts);

For our table, we've again hit the following key notes:

  • TIMESTAMP(ts) elects the ts column as a designated timestamp for partitioning over time.
  • PARTITION BY MONTH creates a monthly partition, where the stored data is effectively sharded by month.
  • DEDUP UPSERT KEYS(ts) deduplicates the timestamp column

The generated data will look like the following:

IDtstempsensorId
12019-10-17T00:00:00.000000Z19.373739119160
22019-10-17T00:00:00.600000Z21.911846179671
32019-10-17T00:00:01.400000Z16.583678348731
42019-10-17T00:00:01.500000Z16.693088153447
52019-10-17T00:00:01.600000Z19.679915697985
............

Nice - and our next table, which includes the sensors themselves and their detail:

Create table as, sensors
CREATE TABLE sensors
AS(
SELECT
x ID, -- Increasing integer
rnd_str('Eberle', 'Honeywell', 'Omron', 'United Automation', 'RS Pro') make, -- Random manufacturer
rnd_str('New York', 'Miami', 'Boston', 'Chicago', 'San Francisco') city -- Random city
FROM long_sequence(10000) x)

Note that we've not included a timestamp in this sensors column. This is one of the rare, demonstrative examples where we're not including it, and thus not taking advantage of the bulk of the benefits received via time-series optimization. As we have a timestamp in the paired readings table, it's helpful to demonstrate them as a pair.

With these two new tables, and our prior financial market data table, we've got a lot of useful queries we can test.

Running queries​

Our financial market data table is a great place to test various aggregate functions, to compute price over time intervals, and similar anaylsis.

However, we'll expand on the readings * sensors tables.

First, let's look at readings, running our shorthand for SELECT * FROM readings;:

readings;

Let's then select the count of records from readings:

SELECT count() FROM readings;
count
10,000,000

And then the average reading:

SELECT avg(temp) FROM readings;
average
18.999217780895

We can now use the sensors table alongside the readings table to get more interesting results using a JOIN:

SELECT *
FROM readings
JOIN(
SELECT ID sensId, make, city
FROM sensors)
ON readings.sensorId = sensId;

The results should look like the table below:

IDtstempsensorIdsensIdmakecity
12019-10-17T00:00:00.000000Z16.47220046098232113211OmronNew York
22019-10-17T00:00:00.100000Z16.59843203359923192319HoneywellSan Francisco
32019-10-17T00:00:00.100000Z20.29368174700987238723HoneywellNew York
42019-10-17T00:00:00.100000Z20.939263119843885885RS ProSan Francisco
52019-10-17T00:00:00.200000Z19.33666005902932003200HoneywellSan Francisco
62019-10-17T00:00:01.100000Z20.94664357695440534053HoneywellMiami

Note the timestamps returned as we've JOIN'd the tables together.

Let's try another type of aggregation:

Aggregation keyed by city
SELECT city, max(temp)
FROM readings
JOIN(
SELECT ID sensId, city
FROM sensors) a
ON readings.sensorId = a.sensId;

The results should look like the table below:

citymax
New York22.999998786398
San Francisco22.999998138348
Miami22.99999994818
Chicago22.999991705861
Boston22.999999233377

Back to time, given we have one table (readings) partitioned by time, let's see what we can do when we JOIN the tables together to perform an aggregation based on an hour of time:

Aggregation by hourly time buckets
SELECT ts, city, make, avg(temp)
FROM readings timestamp(ts)
JOIN
(SELECT ID sensId, city, make
FROM sensors
WHERE city='Miami' AND make='Omron') a
ON readings.sensorId = a.sensId
WHERE ts IN '2019-10-21;1d' -- this is an interval between 2019/10/21 and the next day
SAMPLE BY 1h -- aggregation by hourly time buckets
ALIGN TO CALENDAR; -- align the ts with the start of the hour (hh:00:00)

The results should look like the table below:

tscitymakeaverage
2019-10-21T00:00:00.000000ZMiamiOmron20.004285872098
2019-10-21T00:01:00.000000ZMiamiOmron16.68436714013
2019-10-21T00:02:00.000000ZMiamiOmron15.243684089291
2019-10-21T00:03:00.000000ZMiamiOmron17.193984104315
2019-10-21T00:04:00.000000ZMiamiOmron20.778686822666
............

For more information about these statements, please refer to the SELECT, JOIN and SAMPLE BY pages.

Deleting tables​

We can now clean up the demo data by using DROP TABLE SQL. Be careful using this statement as QuestDB cannot recover data that is deleted in this way:

DROP TABLE readings;
DROP TABLE sensors;
DROP TABLE ticker_price;