Aggregate functions

This page describes the available functions to assist with performing aggregate calculations.

approx_count_distinct

approx_count_distinct(column_name, precision) - estimates the number of distinct non-null values in IPv4, int, or long columns using the HyperLogLog data structure, which provides an approximation rather than an exact count.

The precision of HyperLogLog can be controlled via the optional precision parameter, typically between 4 and 16. A higher precision leads to more accurate results with increased memory usage. The default is 1.

This function is useful within high cardinality datasets where an exact count is not required. Thus consider it the higher cardinality alternative to count_dinstinct.

Parameters

column_name: The name of the column for which to estimate the count of distinct values.
precision (optional): A number specifying the precision of the HyperLogLog algorithm, which influences the trade-off between accuracy and memory usage. A higher precision gives a more accurate estimate, but consumes more memory. Defaults to 1 (lower accuracy, high efficiency).

Return value

Return value type is long.

Examples

Please note that exact example values will vary as they are approximations derived from the HyperLogLog algorithm.

Estimate count of distinct IPv4 addresses with precision 5

SELECT approx_count_distinct(ip_address, 5) FROM logs;

approx_count_distinct
1234567

Estimate count of distinct user_id (int) values by date

SELECT date, approx_count_distinct(user_id) FROM sessions GROUP BY date;

date	approx_count_distinct
2023-01-01	2358
2023-01-02	2491
...	...

Estimate count of distinct product_id values by region

SELECT region, approx_count_distinct(product_id) FROM sales GROUP BY region;

region	approx_count_distinct
North	1589
South	1432
East	1675
West	1543

Estimate count of distinct order_ids with precision 8

SELECT approx_count_distinct(order_id, 8) FROM orders;

approx_count_distinct
3456789

Estimate count of distinct transaction_ids by store_id

SELECT store_id, approx_count_distinct(transaction_id) FROM transactions GROUP BY store_id;

store_id	approx_count_distinct
1	56789
2	67890
...	...

approx_percentile

approx_percentile(value, percentile, precision) calculates the approximate value for the given non-negative column and percentile using the HdrHistogram algorithm.

Arguments

value is any numeric non-negative value.
percentile is a double value between 0.0 and 1.0, inclusive.
precision is an optional int value between 0 and 5, inclusive. This is the number of significant decimal digits to which the histogram will maintain value resolution and separation. For example, when the input column contains integer values between 0 and 3,600,000,000 and the precision is set to 3, value quantization within the range will be no larger than 1/1,000th (or 0.1%) of any value. In this example, the function tracks and analyzes the counts of observed response times ranging between 1 microsecond and 1 hour in magnitude, while maintaining a value resolution of 1 microsecond up to 1 millisecond, a resolution of 1 millisecond (or better) up to one second, and a resolution of 1 second (or better) up to 1,000 seconds. At its maximum tracked value (1 hour), it would still maintain a resolution of 3.6 seconds (or better).

Return value

Return value type is double.

Examples

Approximate percentile

SELECT approx_percentile(latency, 0.99) FROM request_logs;

approx_percentile
101.5

avg

avg(value) calculates simple average of values ignoring missing data (e.g null values).

Arguments

value is any numeric value.

Return value

Return value type is double.

Examples

Average transaction amount

SELECT avg(amount) FROM transactions;

avg
22.4

Average transaction amount by payment_type

SELECT payment_type, avg(amount) FROM transactions;

payment_type	avg
cash	22.1
card	27.4
null	18.02

corr

corr(arg0, arg1) is a function that measures how closely two sets of numbers move in the same direction. It does this by comparing how much each number in each set differs from the average of its set. This calculation is based on Welford's Algorithm.

If the numbers in both sets tend to be above or below their average values at the same time, the function will return a value close to 1.
If one set of numbers tends to be above its average value when the other set is below its average, the function will return a value close to -1.
If there's no clear pattern, the function will return a value close to 0.

Arguments

arg0 is any numeric value representing the first variable
arg1 is any numeric value representing the second variable

Return value

Return value type is double.

Examples

Correlation between price and quantity

SELECT corr(price, quantity) FROM transactions;

corr
0.89

Correlation between price and quantity grouped by payment type

SELECT payment_type, corr(price, quantity) FROM transactions GROUP BY payment_type;

payment_type	avg
cash	0.85
card	0.92
null	0.78

count

count() or count(*) - counts the number of rows irrespective of underlying data.
count(column_name) - counts the number of non-null values in a given column.

Arguments

count() does not require arguments.
count(column_name) - supports the following data types:
- double
- float
- integer
- character
- short
- byte
- timestamp
- date
- long
- long256
- geohash
- varchar
- string
- symbol

Return value

Return value type is long.

Examples

Count of rows in the transactions table:

SELECT count() FROM transactions;

count
100

Count of rows in the transactions table aggregated by the payment_type value:

SELECT payment_type, count() FROM transactions;

payment_type	count
cash	25
card	70
null	5

Count non-null transaction amounts:

SELECT count(amount) FROM transactions;

count
95

Count non-null transaction amounts by payment_type:

SELECT payment_type, count(amount) FROM transactions;

payment_type	count
cash	24
card	67
null	4

note

null values are aggregated with count(), but not with count(column_name)

count_distinct

count_distinct(column_name) - counts distinct non-null values in varchar, symbol, long256, UUID, IPv4, long, int or string columns.

Return value

Return value type is long.

Examples

Count of distinct sides in the transactions table. Side column can either be BUY or SELL or null.

SELECT count_distinct(side) FROM transactions;

count_distinct
2

Count of distinct counterparties in the transactions table aggregated by payment_type value.

SELECT payment_type, count_distinct(counterparty) FROM transactions;

payment_type	count_distinct
cash	3
card	23
null	5

covar_pop

covar_pop(arg0, arg1) is a function that measures how much two sets of numbers change together. It does this by looking at how much each number in each set differs from the average of its set. It multiplies these differences together, adds them all up, and then divides by the total number of pairs. This gives a measure of the overall trend.

If the numbers in both sets tend to be above or below their average values at the same time, the function will return a positive number.
If one set of numbers tends to be above its average value when the other set is below its average, the function will return a negative number.
The closer the result is to zero, the less relationship there is between the two sets of numbers.

Arguments

arg0 is any numeric value representing the first variable
arg1 is any numeric value representing the second variable.

Return value

Return value type is double.

Examples

Population covariance between price and quantity

SELECT covar_pop(price, quantity) FROM transactions;

covar_pop
15.2

Population covariance between price and quantity grouped by payment type

SELECT payment_type, covar_pop(price, quantity) FROM transactions GROUP BY payment_type;

payment_type	covar_pop
cash	14.8
card	16.2
null	13.5

covar_samp

covar_samp(arg0, arg1) is a function that finds the relationship between two sets of numbers. It does this by looking at how much the numbers vary from the average in each set.

If the numbers in both sets tend to be above or below their average values at the same time, the function will return a positive number.
If one set of numbers tends to be above its average value when the other set is below its average, the function will return a negative number.
The closer the result is to zero, the less relationship there is between the two sets of numbers.

Arguments

arg0 is any numeric value representing the first variable.
arg1 is any numeric value representing the second variable.

Return value

Return value type is double.

Examples

Sample covariance between price and quantity

SELECT covar_samp(price, quantity) FROM transactions;

covar_samp
15.8

Sample covariance between price and quantity grouped by payment type

SELECT payment_type, covar_samp(price, quantity) FROM transactions GROUP BY payment_type;

payment_type	covar_samp
cash	15.4
card	16.8
null	14.1

first/last

first(column_name) - returns the first value of a column.
last(column_name) - returns the last value of a column.

Supported column datatype: double, float, integer, IPv4, character, short, byte, timestamp, date, long, geohash, symbol, varchar and uuid.

If a table has a designated timestamp, then the first row is always the row with the lowest timestamp (oldest) and the last row is always the one with the highest (latest) timestamp. For a table without a designated timestamp column, first returns the first row and last returns the last inserted row, regardless of any timestamp column.

Return value

Return value type is the same as the type of the argument.

Examples

Given a table sensors, which has a designated timestamp column:

device_id	temperature	ts
arduino-01	12	2021-06-02T14:33:19.970258Z
arduino-02	10	2021-06-02T14:33:21.703934Z
arduino-03	18	2021-06-02T14:33:23.707013Z

The following query returns oldest value for the device_id column:

SELECT first(device_id) FROM sensors;

first
arduino-01

The following query returns the latest symbol value for the device_id column:

SELECT last(device_id) FROM sensors;

last
arduino-03

Without selecting a designated timestamp column, the table may be unordered and the query may return different result. Given an unordered table sensors_unordered:

device_id	temperature	ts
arduino-01	12	2021-06-02T14:33:19.970258Z
arduino-03	18	2021-06-02T14:33:23.707013Z
arduino-02	10	2021-06-02T14:33:21.703934Z

The following query returns the first record for the device_id column:

SELECT first(device_id) FROM sensors_unordered;

first
arduino-01

The following query returns the last record for the device_id column:

SELECT last(device_id) FROM sensors_unordered;

last
arduino-02

first_not_null

first_not_null(column_name) - returns the first non-null value of a column.

Supported column datatype: double, float, integer, IPv4, char, short, byte, timestamp, date, long, geohash, symbol, varchar and uuid.

If a table has a designated timestamp, then the first non-null row is always the row with the lowest timestamp (oldest). For a table without a designated timestamp column, first_not_null returns the first non-null row, regardless of any timestamp column.

Return value

Return value type is the same as the type of the argument.

Examples

Given a table sensors, which has a designated timestamp column:

device_id	temperature	ts
null	12	2021-06-02T14:33:19.970258Z
arduino-02	10	2021-06-02T14:33:21.703934Z
arduino-03	18	2021-06-02T14:33:23.707013Z

The following query returns oldest non-null value for the device_id column:

SELECT first_not_null(device_id) FROM sensors;

first_not_null
arduino-02

Without selecting a designated timestamp column, the table may be unordered and the query may return different result. Given an unordered table sensors_unordered:

device_id	temperature	ts
null	12	2021-06-02T14:33:19.970258Z
arduino-03	18	2021-06-02T14:33:23.707013Z
arduino-02	10	2021-06-02T14:33:21.703934Z

The following query returns the first non-null record for the device_id column:

SELECT first_not_null(device_id) FROM sensors_unordered;

first_not_null
arduino-03

last_not_null

last_not_null(column_name) - returns the last non-null value of a column.

Supported column datatype: double, float, integer, IPv4, char, short, byte, timestamp, date, long, geohash, symbol, varchar and uuid.

If a table has a designated timestamp, then the last non-null row is always the row with the highest timestamp (most recent). For a table without a designated timestamp column, last_not_null returns the last non-null row, regardless of any timestamp column.

Return value

Return value type is the same as the type of the argument.

Examples

Given a table sensors, which has a designated timestamp column:

device_id	temperature	ts
null	12	2021-06-02T14:33:19.970258Z
arduino-02	10	2021-06-02T14:33:21.703934Z
arduino-03	18	2021-06-02T14:33:23.707013Z

The following query returns most recent non-null value for the device_id column:

SELECT last_not_null(device_id) FROM sensors;

last_not_null
arduino-03

Without selecting a designated timestamp column, the table may be unordered and the query may return different result. Given an unordered table sensors_unordered:

device_id	temperature	ts
null	12	2021-06-02T14:33:19.970258Z
arduino-03	18	2021-06-02T14:33:23.707013Z
arduino-02	10	2021-06-02T14:33:21.703934Z

The following query returns the last non-null record for the device_id column:

SELECT last_not_null(device_id) FROM sensors_unordered;

last_not_null
arduino-02

haversine_dist_deg

haversine_dist_deg(lat, lon, ts) - calculates the traveled distance for a series of latitude and longitude points.

Arguments

lat is the latitude expressed as degrees in decimal format (double)
lon is the longitude expressed as degrees in decimal format (double)
ts is the timestamp for the data point

Return value

Return value type is double.

Examples

Calculate the aggregate traveled distance for each car_id

SELECT car_id, haversine_dist_deg(lat, lon, k)
  FROM table rides

ksum

ksum(value) - adds values ignoring missing data (e.g null values). Values are added using the

Kahan compensated sum algorithm. This is only beneficial for floating-point values such as float or double.

Arguments

value is any numeric value.

Return value

Return value type is the same as the type of the argument.

Examples

SELECT ksum(a)
FROM (SELECT rnd_double() a FROM long_sequence(100));

ksum
52.79143968514029

max

max(value) - returns the highest value ignoring missing data (e.g null values).

Arguments

value is any numeric or string value

Return value

Return value type is the same as the type of the argument.

Examples

Highest transaction amount

SELECT max(amount) FROM transactions;

max
55.3

Highest transaction amount by payment_type

SELECT payment_type, max(amount) FROM transactions;

payment_type	amount
cash	31.5
card	55.3
null	29.2

min

min(value) - returns the lowest value ignoring missing data (e.g null values).

Arguments

value is any numeric or string value

Return value

Return value type is the same as the type of the argument.

Examples

Lowest transaction amount

SELECT min(amount) FROM transactions;

min
12.5

Lowest transaction amount, by payment_type

SELECT payment_type, min(amount) FROM transactions;

payment_type	min
cash	12.5
card	15.3
null	22.2

nsum

nsum(value) - adds values ignoring missing data (e.g null values). Values are added using the Neumaier sum algorithm. This is only beneficial for floating-point values such as float or double.

Arguments

value is any numeric value.

Return value

Return value type is double.

Examples

SELECT nsum(a)
FROM (SELECT rnd_double() a FROM long_sequence(100));

nsum
49.5442334742831

stddev / stddev_samp

stddev_samp(value) - Calculates the sample standard deviation of a set of values, ignoring missing data (e.g., null values). The sample standard deviation is a measure of the amount of variation or dispersion in a sample of a population. A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range.

stddev is an alias for stddev_samp.

Arguments

value is any numeric value.

Return value

Return value type is double.

Examples

SELECT stddev_samp(x)
FROM (SELECT x FROM long_sequence(100));

stddev_samp
29.011491975882

stddev_pop

stddev_pop(value) - Calculates the population standard deviation of a set of values. The population standard deviation is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range.

Arguments

value is any numeric value.

Return value

Return value type is double.

Examples

SELECT stddev_pop(x)
FROM (SELECT x FROM long_sequence(100));

stddev_samp
28.86607004772212

string_agg

string_agg(value, delimiter) - Concatenates the given string values into a single string with the delimiter used as a value separator.

Arguments

value is a varchar value.
delimiter is a char value.

Return value

Return value type is varchar.

Examples

SELECT string_agg(x::varchar, ',')
FROM (SELECT x FROM long_sequence(5));

string_agg
1,2,3,4,5

sum

sum(value) - adds values ignoring missing data (e.g null values).

Arguments

value is any numeric value.

Return value

Return value type is the same as the type of the argument.

Examples

Sum all quantities in the transactions table

SELECT sum(quantity) FROM transactions;

sum
100

Sum all quantities in the transactions table, aggregated by item

SELECT item, sum(quantity) FROM transactions;

item	count
apple	53
orange	47

Overflow

sum does not perform overflow check. To avoid overflow, you can cast the argument to wider type.

Cast as long to avoid overflow

SELECT sum(cast(a AS LONG)) FROM table;

variance / var_samp

var_samp(value) - Calculates the sample variance of a set of values. The sample variance is a measure of the amount of variation or dispersion of a set of values in a sample from a population. A low variance indicates that the values tend to be very close to the mean, while a high variance indicates that the values are spread out over a wider range.

variance() is an alias for var_samp.

Arguments

value is any numeric value.

Return value

Return value type is double.

Examples

SELECT var_samp(x)
FROM (SELECT x FROM long_sequence(100));

stddev_samp
841.666666666666

var_pop

var_pop(value) - Calculates the population variance of a set of values. The population variance is a measure of the amount of variation or dispersion of a set of values. A low variance indicates that the values tend to be very close to the mean, while a high variance indicates that the values are spread out over a wider range.

Arguments

value is any numeric value.

Return value

Return value type is double.

Examples

SELECT var_pop(x)
FROM (SELECT x FROM long_sequence(100));

stddev_samp
833.25

approx_count_distinct​

Parameters​

Return value​

Examples​

approx_percentile​

Arguments​

Return value​

Examples​

avg​

Arguments​

Return value​

Examples​

corr​

Arguments​

Return value​

Examples​

count​

Arguments​

Return value​

Examples​

count_distinct​

Return value​

Examples​

covar_pop​

Arguments​

Return value​

Examples​

covar_samp​

Arguments​

Return value​

Examples​

first/last​

Return value​

Examples​

first_not_null​

Return value​

Examples​

last_not_null​

Return value​

Examples​

haversine_dist_deg​

Arguments​

Return value​

Examples​

ksum​

Arguments​

Return value​

Examples​

max​

Arguments​

Return value​

Examples​

min​

Arguments​

Return value​

Examples​

nsum​

Arguments​

Return value​

Examples​

stddev / stddev_samp​

Arguments​

Return value​

Examples​

stddev_pop​

Arguments​

Return value​

Examples​

string_agg​

Arguments​

Return value​

Examples​

sum​

Arguments​

Return value​

Examples​

Overflow​

variance / var_samp​

Arguments​

Return value​

approx_count_distinct

Parameters

Return value

Examples

approx_percentile

Arguments

Return value

Examples

avg

Arguments

Return value

Examples

corr

Arguments

Return value

Examples

count

Arguments

Return value

Examples

count_distinct

Return value

Examples

covar_pop

Arguments

Return value

Examples

covar_samp

Arguments

Return value

Examples

first/last

Return value

Examples

first_not_null

Return value

Examples

last_not_null

Return value

Examples

haversine_dist_deg

Arguments

Return value

Examples

ksum

Arguments

Return value

Examples

max

Arguments

Return value

Examples

min

Arguments

Return value

Examples

nsum

Arguments

Return value

Examples

stddev / stddev_samp

Arguments

Return value

Examples

stddev_pop

Arguments

Return value

Examples

string_agg

Arguments

Return value

Examples

sum

Arguments

Return value

Examples

Overflow

variance / var_samp

Arguments

Return value