Aggregate functions
This page describes the available functions to assist with performing aggregate calculations.
approx_count_distinct
approx_count_distinct(column_name, precision)
 estimates the number of
distinct nonnull
values in IPv4
, int
, or long
columns using the
HyperLogLog data structure, which provides an
approximation rather than an exact count.
The precision of HyperLogLog can be controlled via the optional precision
parameter, typically between 4 and 16. A higher precision leads to more accurate
results with increased memory usage. The default is 1.
This function is useful within high cardinality datasets where an exact count is
not required. Thus consider it the higher cardinality alternative to
count_dinstinct
.
Parameters
column_name
: The name of the column for which to estimate the count of distinct values.precision
(optional): A number specifying the precision of the HyperLogLog algorithm, which influences the tradeoff between accuracy and memory usage. A higher precision gives a more accurate estimate, but consumes more memory. Defaults to 1 (lower accuracy, high efficiency).
Return value
Return value type is long
.
Examples
Please note that exact example values will vary as they are approximations derived from the HyperLogLog algorithm.
SELECT approx_count_distinct(ip_address, 5) FROM logs;
approx_count_distinct 

1234567 
SELECT date, approx_count_distinct(user_id) FROM sessions GROUP BY date;
date  approx_count_distinct 

20230101  2358 
20230102  2491 
...  ... 
SELECT region, approx_count_distinct(product_id) FROM sales GROUP BY region;
region  approx_count_distinct 

North  1589 
South  1432 
East  1675 
West  1543 
SELECT approx_count_distinct(order_id, 8) FROM orders;
approx_count_distinct 

3456789 
SELECT store_id, approx_count_distinct(transaction_id) FROM transactions GROUP BY store_id;
store_id  approx_count_distinct 

1  56789 
2  67890 
...  ... 
approx_percentile
approx_percentile(value, percentile, precision)
calculates the approximate
value for the given nonnegative column and percentile using the
HdrHistogram algorithm.
Arguments
value
is any numeric nonnegative value.percentile
is adouble
value between 0.0 and 1.0, inclusive.precision
is an optionalint
value between 0 and 5, inclusive. This is the number of significant decimal digits to which the histogram will maintain value resolution and separation. For example, when the input column contains integer values between 0 and 3,600,000,000 and the precision is set to 3, value quantization within the range will be no larger than 1/1,000th (or 0.1%) of any value. In this example, the function tracks and analyzes the counts of observed response times ranging between 1 microsecond and 1 hour in magnitude, while maintaining a value resolution of 1 microsecond up to 1 millisecond, a resolution of 1 millisecond (or better) up to one second, and a resolution of 1 second (or better) up to 1,000 seconds. At its maximum tracked value (1 hour), it would still maintain a resolution of 3.6 seconds (or better).
Return value
Return value type is double
.
Examples
SELECT approx_percentile(latency, 0.99) FROM request_logs;
approx_percentile 

101.5 
avg
avg(value)
calculates simple average of values ignoring missing data (e.g
null
values).
Arguments
value
is any numeric value.
Return value
Return value type is double
.
Examples
SELECT avg(amount) FROM transactions;
avg 

22.4 
SELECT payment_type, avg(amount) FROM transactions;
payment_type  avg 

cash  22.1 
card  27.4 
null  18.02 
corr
corr(arg0, arg1)
is a function that measures how closely two sets of numbers
move in the same direction. It does this by comparing how much each number in
each set differs from the average of its set. This calculation is based on
Welford's Algorithm.

If the numbers in both sets tend to be above or below their average values at the same time, the function will return a value close to 1.

If one set of numbers tends to be above its average value when the other set is below its average, the function will return a value close to 1.

If there's no clear pattern, the function will return a value close to 0.
Arguments
arg0
is any numeric value representing the first variablearg1
is any numeric value representing the second variable
Return value
Return value type is double
.
Examples
SELECT corr(price, quantity) FROM transactions;
corr 

0.89 
SELECT payment_type, corr(price, quantity) FROM transactions GROUP BY payment_type;
payment_type  avg 

cash  0.85 
card  0.92 
null  0.78 
count
count()
orcount(*)
 counts the number of rows irrespective of underlying data.count(column_name)
 counts the number of nonnull values in a given column.
Arguments
count()
does not require arguments.count(column_name)
 supports the following data types:double
float
integer
character
short
byte
timestamp
date
long
long256
geohash
varchar
string
symbol
Return value
Return value type is long
.
Examples
Count of rows in the transactions
table:
SELECT count() FROM transactions;
count 

100 
Count of rows in the transactions
table aggregated by the payment_type
value:
SELECT payment_type, count() FROM transactions;
payment_type  count 

cash  25 
card  70 
null  5 
Count nonnull transaction amounts:
SELECT count(amount) FROM transactions;
count 

95 
Count nonnull transaction amounts by payment_type
:
SELECT payment_type, count(amount) FROM transactions;
payment_type  count 

cash  24 
card  67 
null  4 
null
values are aggregated with count()
, but not with count(column_name)
count_distinct
count_distinct(column_name)
 counts distinct nonnull
values in varchar
,
symbol
, long256
, UUID
, IPv4
, long
, int
or string
columns.
Return value
Return value type is long
.
Examples
 Count of distinct sides in the transactions table. Side column can either be
BUY
orSELL
ornull
.
SELECT count_distinct(side) FROM transactions;
count_distinct 

2 
 Count of distinct counterparties in the transactions table aggregated by
payment_type
value.
SELECT payment_type, count_distinct(counterparty) FROM transactions;
payment_type  count_distinct 

cash  3 
card  23 
null  5 
covar_pop
covar_pop(arg0, arg1)
is a function that measures how much two sets of numbers
change together. It does this by looking at how much each number in each set
differs from the average of its set. It multiplies these differences together,
adds them all up, and then divides by the total number of pairs. This gives a
measure of the overall trend.

If the numbers in both sets tend to be above or below their average values at the same time, the function will return a positive number.

If one set of numbers tends to be above its average value when the other set is below its average, the function will return a negative number.

The closer the result is to zero, the less relationship there is between the two sets of numbers.
Arguments
arg0
is any numeric value representing the first variablearg1
is any numeric value representing the second variable.
Return value
Return value type is double
.
Examples
SELECT covar_pop(price, quantity) FROM transactions;
covar_pop 

15.2 
SELECT payment_type, covar_pop(price, quantity) FROM transactions GROUP BY payment_type;
payment_type  covar_pop 

cash  14.8 
card  16.2 
null  13.5 
covar_samp
covar_samp(arg0, arg1)
is a function that finds the relationship between two
sets of numbers. It does this by looking at how much the numbers vary from the
average in each set.

If the numbers in both sets tend to be above or below their average values at the same time, the function will return a positive number.

If one set of numbers tends to be above its average value when the other set is below its average, the function will return a negative number.

The closer the result is to zero, the less relationship there is between the two sets of numbers.
Arguments
arg0
is any numeric value representing the first variable.arg1
is any numeric value representing the second variable.
Return value
Return value type is double
.
Examples
SELECT covar_samp(price, quantity) FROM transactions;
covar_samp 

15.8 
SELECT payment_type, covar_samp(price, quantity) FROM transactions GROUP BY payment_type;
payment_type  covar_samp 

cash  15.4 
card  16.8 
null  14.1 
first/last
first(column_name)
 returns the first value of a column.last(column_name)
 returns the last value of a column.
Supported column datatype: double
, float
, integer
, IPv4
, character
,
short
, byte
, timestamp
, date
, long
, geohash
, symbol
, varchar
and uuid
.
If a table has a designated timestamp,
then the first row is always the row with the lowest timestamp (oldest) and the
last row is always the one with the highest (latest) timestamp. For a table
without a designated timestamp column, first
returns the first row and last
returns the last inserted row, regardless of any timestamp column.
Return value
Return value type is the same as the type of the argument.
Examples
Given a table sensors
, which has a designated timestamp column:
device_id  temperature  ts 

arduino01  12  20210602T14:33:19.970258Z 
arduino02  10  20210602T14:33:21.703934Z 
arduino03  18  20210602T14:33:23.707013Z 
The following query returns oldest value for the device_id
column:
SELECT first(device_id) FROM sensors;
first 

arduino01 
The following query returns the latest symbol value for the device_id
column:
SELECT last(device_id) FROM sensors;
last 

arduino03 
Without selecting a designated timestamp column, the table may be unordered and
the query may return different result. Given an unordered table
sensors_unordered
:
device_id  temperature  ts 

arduino01  12  20210602T14:33:19.970258Z 
arduino03  18  20210602T14:33:23.707013Z 
arduino02  10  20210602T14:33:21.703934Z 
The following query returns the first record for the device_id
column:
SELECT first(device_id) FROM sensors_unordered;
first 

arduino01 
The following query returns the last record for the device_id
column:
SELECT last(device_id) FROM sensors_unordered;
last 

arduino02 
first_not_null
first_not_null(column_name)
 returns the first nonnull value of a column.
Supported column datatype: double
, float
, integer
, IPv4
, char
,
short
, byte
, timestamp
, date
, long
, geohash
, symbol
, varchar
and uuid
.
If a table has a designated timestamp, then the first nonnull row is always the
row with the lowest timestamp (oldest). For a table without a designated
timestamp column, first_not_null
returns the first nonnull row, regardless of
any timestamp column.
Return value
Return value type is the same as the type of the argument.
Examples
Given a table sensors
, which has a designated timestamp column:
device_id  temperature  ts 

null  12  20210602T14:33:19.970258Z 
arduino02  10  20210602T14:33:21.703934Z 
arduino03  18  20210602T14:33:23.707013Z 
The following query returns oldest nonnull value for the device_id column:
SELECT first_not_null(device_id) FROM sensors;
first_not_null 

arduino02 
Without selecting a designated timestamp column, the table may be unordered and
the query may return different result. Given an unordered table
sensors_unordered
:
device_id  temperature  ts 

null  12  20210602T14:33:19.970258Z 
arduino03  18  20210602T14:33:23.707013Z 
arduino02  10  20210602T14:33:21.703934Z 
The following query returns the first nonnull record for the device_id column:
SELECT first_not_null(device_id) FROM sensors_unordered;
first_not_null 

arduino03 
last_not_null
last_not_null(column_name)
 returns the last nonnull value of a column.
Supported column datatype: double
, float
, integer
, IPv4
, char
,
short
, byte
, timestamp
, date
, long
, geohash
, symbol
, varchar
and uuid
.
If a table has a designated timestamp, then the last nonnull row is always the
row with the highest timestamp (most recent). For a table without a designated
timestamp column, last_not_null
returns the last nonnull row, regardless of
any timestamp column.
Return value
Return value type is the same as the type of the argument.
Examples
Given a table sensors
, which has a designated timestamp column:
device_id  temperature  ts 

null  12  20210602T14:33:19.970258Z 
arduino02  10  20210602T14:33:21.703934Z 
arduino03  18  20210602T14:33:23.707013Z 
The following query returns most recent nonnull value for the device_id column:
SELECT last_not_null(device_id) FROM sensors;
last_not_null 

arduino03 
Without selecting a designated timestamp column, the table may be unordered and
the query may return different result. Given an unordered table
sensors_unordered
:
device_id  temperature  ts 

null  12  20210602T14:33:19.970258Z 
arduino03  18  20210602T14:33:23.707013Z 
arduino02  10  20210602T14:33:21.703934Z 
The following query returns the last nonnull record for the device_id
column:
SELECT last_not_null(device_id) FROM sensors_unordered;
last_not_null 

arduino02 
haversine_dist_deg
haversine_dist_deg(lat, lon, ts)
 calculates the traveled distance for a
series of latitude and longitude points.
Arguments
lat
is the latitude expressed as degrees in decimal format (double
)lon
is the longitude expressed as degrees in decimal format (double
)ts
is thetimestamp
for the data point
Return value
Return value type is double
.
Examples
SELECT car_id, haversine_dist_deg(lat, lon, k)
FROM table rides
ksum
ksum(value)
 adds values ignoring missing data (e.g null
values). Values
are added using the
Kahan compensated sum algorithm.
This is only beneficial for floatingpoint values such as float
or double
.
Arguments
value
is any numeric value.
Return value
Return value type is the same as the type of the argument.
Examples
SELECT ksum(a)
FROM (SELECT rnd_double() a FROM long_sequence(100));
ksum 

52.79143968514029 
max
max(value)
 returns the highest value ignoring missing data (e.g null
values).
Arguments
value
is any numeric or string value
Return value
Return value type is the same as the type of the argument.
Examples
SELECT max(amount) FROM transactions;
max 

55.3 
SELECT payment_type, max(amount) FROM transactions;
payment_type  amount 

cash  31.5 
card  55.3 
null  29.2 
min
min(value)
 returns the lowest value ignoring missing data (e.g null
values).
Arguments
value
is any numeric or string value
Return value
Return value type is the same as the type of the argument.
Examples
SELECT min(amount) FROM transactions;
min 

12.5 
SELECT payment_type, min(amount) FROM transactions;
payment_type  min 

cash  12.5 
card  15.3 
null  22.2 
nsum
nsum(value)
 adds values ignoring missing data (e.g null
values). Values
are added using the
Neumaier sum algorithm.
This is only beneficial for floatingpoint values such as float
or double
.
Arguments
value
is any numeric value.
Return value
Return value type is double
.
Examples
SELECT nsum(a)
FROM (SELECT rnd_double() a FROM long_sequence(100));
nsum 

49.5442334742831 
stddev / stddev_samp
stddev_samp(value)
 Calculates the sample standard deviation of a set of
values, ignoring missing data (e.g., null values). The sample standard deviation
is a measure of the amount of variation or dispersion in a sample of a
population. A low standard deviation indicates that the values tend to be close
to the mean of the set, while a high standard deviation indicates that the
values are spread out over a wider range.
stddev
is an alias for stddev_samp
.
Arguments
value
is any numeric value.
Return value
Return value type is double
.
Examples
SELECT stddev_samp(x)
FROM (SELECT x FROM long_sequence(100));
stddev_samp 

29.011491975882 
stddev_pop
stddev_pop(value)
 Calculates the population standard deviation of a set of
values. The population standard deviation is a measure of the amount of
variation or dispersion of a set of values. A low standard deviation indicates
that the values tend to be close to the mean of the set, while a high standard
deviation indicates that the values are spread out over a wider range.
Arguments
value
is any numeric value.
Return value
Return value type is double
.
Examples
SELECT stddev_pop(x)
FROM (SELECT x FROM long_sequence(100));
stddev_samp 

28.86607004772212 
string_agg
string_agg(value, delimiter)
 Concatenates the given string values into a
single string with the delimiter used as a value separator.
Arguments
value
is avarchar
value.delimiter
is achar
value.
Return value
Return value type is varchar
.
Examples
SELECT string_agg(x::varchar, ',')
FROM (SELECT x FROM long_sequence(5));
string_agg 

1,2,3,4,5 
sum
sum(value)
 adds values ignoring missing data (e.g null
values).
Arguments
value
is any numeric value.
Return value
Return value type is the same as the type of the argument.
Examples
SELECT sum(quantity) FROM transactions;
sum 

100 
SELECT item, sum(quantity) FROM transactions;
item  count 

apple  53 
orange  47 
Overflow
sum
does not perform overflow check. To avoid overflow, you can cast the
argument to wider type.
SELECT sum(cast(a AS LONG)) FROM table;
variance / var_samp
var_samp(value)
 Calculates the sample variance of a set of values. The
sample variance is a measure of the amount of variation or dispersion of a set
of values in a sample from a population. A low variance indicates that the
values tend to be very close to the mean, while a high variance indicates that
the values are spread out over a wider range.
variance()
is an alias for var_samp
.
Arguments
value
is any numeric value.
Return value
Return value type is double
.
Examples
SELECT var_samp(x)
FROM (SELECT x FROM long_sequence(100));
stddev_samp 

841.666666666666 
var_pop
var_pop(value)
 Calculates the population variance of a set of values. The
population variance is a measure of the amount of variation or dispersion of a
set of values. A low variance indicates that the values tend to be very close to
the mean, while a high variance indicates that the values are spread out over a
wider range.
Arguments
value
is any numeric value.
Return value
Return value type is double
.
Examples
SELECT var_pop(x)
FROM (SELECT x FROM long_sequence(100));
stddev_samp 

833.25 