Aggregate functions
This page describes the available functions to assist with performing aggregate calculations.
#
approx_percentileapprox_percentile(value, percentile, precision)
calculates the approximate
value for the given non-negative column and percentile using the
HdrHistogram algorithm.
#
Argumentsvalue
is any numeric non-negative value.percentile
is adouble
value between 0.0 and 1.0, inclusive.precision
is an optionalint
value between 0 and 5, inclusive. This is the number of significant decimal digits to which the histogram will maintain value resolution and separation. For example, when the input column contains integer values between 0 and 3,600,000,000 and the precision is set to 3, value quantization within the range will be no larger than 1/1,000th (or 0.1%) of any value. In this example, the function tracks and analyzes the counts of observed response times ranging between 1 microsecond and 1 hour in magnitude, while maintaining a value resolution of 1 microsecond up to 1 millisecond, a resolution of 1 millisecond (or better) up to one second, and a resolution of 1 second (or better) up to 1,000 seconds. At its maximum tracked value (1 hour), it would still maintain a resolution of 3.6 seconds (or better).
#
Return valueReturn value type is double
.
#
Examplesapprox_percentile |
---|
101.5 |
#
avgavg(value)
calculates simple average of values ignoring missing data (e.g
null
values).
#
Argumentsvalue
is any numeric value.
#
Return valueReturn value type is double
.
#
Examplesavg |
---|
22.4 |
payment_type | avg |
---|---|
cash | 22.1 |
card | 27.4 |
null | 18.02 |
#
countcount()
orcount(*)
- counts the number of rows irrespective of underlying data.count(column_name)
- counts the number of non-null values in a given column.
#
Argumentscount()
does not require arguments.count(column_name)
- supports the following data types:double
float
integer
character
short
byte
timestamp
date
long
long256
geohash
string
symbol
#
Return valueReturn value type is long
.
#
ExamplesCount of rows in the transactions
table:
count |
---|
100 |
Count of rows in the transactions
table aggregated by the payment_type
value:
payment_type | count |
---|---|
cash | 25 |
card | 70 |
null | 5 |
Count non-null transaction amounts:
count |
---|
95 |
Count non-null transaction amounts by payment_type
:
payment_type | count |
---|---|
cash | 24 |
card | 67 |
null | 4 |
note
null
values are aggregated with count()
, but not with count(column_name)
#
count_distinctcount_distinct(column_name)
- counts distinct values in string
, symbol
,
long256
, long
, or int
columns.
#
Return valueReturn value type is long
.
#
Examples- Count of distinct sides in the transactions table. Side column can either be
BUY
orSELL
ornull
count_distinct |
---|
2 |
- Count of distinct counterparties in the transactions table aggregated by
payment_type
value.
payment_type | count_distinct |
---|---|
cash | 3 |
card | 23 |
null | 5 |
note
null
values are not counted in the count_distinct
function.
#
first/lastfirst(column_name)
- returns the first value of a column.last(column_name)
- returns the last value of a column.
Supported column datatype: double
, float
, integer
, IPv4
, character
,
short
, byte
, timestamp
, date
, long
, geohash
, symbol
, uuid
.
If a table has a designated timestamp,
then the first row is always the row with the lowest timestamp (oldest) and the
last row is always the one with the highest (latest) timestamp. For a table
without a designated timestamp column, first
returns the first row and last
returns the last inserted row, regardless of any timestamp column.
#
Return valueReturn value type is string
.
#
ExamplesGiven a table sensors
, which has a designated timestamp column:
device_id | temperature | ts |
---|---|---|
arduino-01 | 12 | 2021-06-02T14:33:19.970258Z |
arduino-02 | 10 | 2021-06-02T14:33:21.703934Z |
arduino-03 | 18 | 2021-06-02T14:33:23.707013Z |
The following query returns oldest value for the device_id
column:
first |
---|
arduino-01 |
The following query returns the latest symbol value for the device_id
column:
last |
---|
arduino-03 |
Without selecting a designated timestamp column, the table may be unordered and
the query may return different result. Given an unordered table
sensors_unordered
:
device_id | temperature | ts |
---|---|---|
arduino-01 | 12 | 2021-06-02T14:33:19.970258Z |
arduino-03 | 18 | 2021-06-02T14:33:23.707013Z |
arduino-02 | 10 | 2021-06-02T14:33:21.703934Z |
The following query returns the first record for the device_id
column:
first |
---|
arduino-01 |
The following query returns the last record for the device_id
column:
last |
---|
arduino-02 |
#
first_not_nullfirst_not_null(column_name)
- returns the first non-null value of a column.
Supported column datatype: double
, float
, integer
, IPv4
, char
,
short
, byte
, timestamp
, date
, long
, geohash
, symbol
and uuid
.
If a table has a designated timestamp, then the first non-null row is always the
row with the lowest timestamp (oldest). For a table without a designated
timestamp column, first_not_null
returns the first non-null row, regardless of
any timestamp column.
#
Return valueReturn value type is string
.
#
ExamplesGiven a table sensors
, which has a designated timestamp column:
device_id | temperature | ts |
---|---|---|
null | 12 | 2021-06-02T14:33:19.970258Z |
arduino-02 | 10 | 2021-06-02T14:33:21.703934Z |
arduino-03 | 18 | 2021-06-02T14:33:23.707013Z |
The following query returns oldest non-null value for the device_id column:
first_not_null |
---|
arduino-02 |
Without selecting a designated timestamp column, the table may be unordered and
the query may return different result. Given an unordered table
sensors_unordered
:
device_id | temperature | ts |
---|---|---|
null | 12 | 2021-06-02T14:33:19.970258Z |
arduino-03 | 18 | 2021-06-02T14:33:23.707013Z |
arduino-02 | 10 | 2021-06-02T14:33:21.703934Z |
The following query returns the first non-null record for the device_id column:
first_not_null |
---|
arduino-03 |
#
last_not_nulllast_not_null(column_name)
- returns the last non-null value of a column.
Supported column datatype: double
, float
, integer
, IPv4
, char
,
short
, byte
, timestamp
, date
, long
, geohash
, symbol
and uuid
.
If a table has a designated timestamp, then the last non-null row is always the
row with the highest timestamp (most recent). For a table without a designated
timestamp column, last_not_null
returns the last non-null row, regardless of
any timestamp column.
#
Return valueReturn value type is string
.
#
ExamplesGiven a table sensors
, which has a designated timestamp column:
device_id | temperature | ts |
---|---|---|
null | 12 | 2021-06-02T14:33:19.970258Z |
arduino-02 | 10 | 2021-06-02T14:33:21.703934Z |
arduino-03 | 18 | 2021-06-02T14:33:23.707013Z |
The following query returns most recent non-null value for the device_id column:
last_not_null |
---|
arduino-03 |
Without selecting a designated timestamp column, the table may be unordered and
the query may return different result. Given an unordered table
sensors_unordered
:
device_id | temperature | ts |
---|---|---|
null | 12 | 2021-06-02T14:33:19.970258Z |
arduino-03 | 18 | 2021-06-02T14:33:23.707013Z |
arduino-02 | 10 | 2021-06-02T14:33:21.703934Z |
The following query returns the last non-null record for the device_id
column:
last_not_null |
---|
arduino-02 |
#
haversine_dist_deghaversine_dist_deg(lat, lon, ts)
- calculates the traveled distance for a
series of latitude and longitude points.
#
Argumentslat
is the latitude expressed as degrees in decimal format (double
)lon
is the longitude expressed as degrees in decimal format (double
)ts
is thetimestamp
for the data point
#
Return valueReturn value type is double
.
#
Examples#
ksumksum(value)
- adds values ignoring missing data (e.g null
values). Values
are added using the
Kahan compensated sum algorithm.
This is only beneficial for floating-point values such as float
or double
.
#
Argumentsvalue
is any numeric value.
#
Return valueReturn value type is the same as the type of the argument.
#
Examplesksum |
---|
52.79143968514029 |
#
maxmax(value)
- returns the highest value ignoring missing data (e.g null
values).
#
Argumentsvalue
is any numeric or string value
#
Return valueReturn value type is the same as the type of the argument.
#
Examplesmax |
---|
55.3 |
payment_type | amount |
---|---|
cash | 31.5 |
card | 55.3 |
null | 29.2 |
#
minmin(value)
- returns the lowest value ignoring missing data (e.g null
values).
#
Argumentsvalue
is any numeric or string value
#
Return valueReturn value type is the same as the type of the argument.
#
Examplesmin |
---|
12.5 |
payment_type | min |
---|---|
cash | 12.5 |
card | 15.3 |
null | 22.2 |
#
nsumnsum(value)
- adds values ignoring missing data (e.g null
values). Values
are added using the
Neumaier sum algorithm.
This is only beneficial for floating-point values such as float
or double
.
#
Argumentsvalue
is any numeric value.
#
Return valueReturn value type is double
.
#
Examplesnsum |
---|
49.5442334742831 |
#
stddev / stddev_sampstddev_samp(value)
- Calculates the sample standard deviation of a set of
values, ignoring missing data (e.g., null values). The sample standard deviation
is a measure of the amount of variation or dispersion in a sample of a
population. A low standard deviation indicates that the values tend to be close
to the mean of the set, while a high standard deviation indicates that the
values are spread out over a wider range.
stddev
is an alias for stddev_samp
.
#
Argumentsvalue
is any numeric value.
#
Return valueReturn value type is double
.
#
Examplesstddev_samp |
---|
29.011491975882 |
#
stddev_popstddev_pop(value)
- Calculates the population standard deviation of a set of
values. The population standard deviation is a measure of the amount of
variation or dispersion of a set of values. A low standard deviation indicates
that the values tend to be close to the mean of the set, while a high standard
deviation indicates that the values are spread out over a wider range.
#
Argumentsvalue
is any numeric value.
#
Return valueReturn value type is double
.
#
Examplesstddev_samp |
---|
28.86607004772212 |
#
string_aggstring_agg(value, delimiter)
- Concatenates the given string values into a
single string with the delimiter used as a value separator.
#
Argumentsvalue
is astring
value.delimiter
is achar
value.
#
Return valueReturn value type is string
.
#
Examplesstring_agg |
---|
1,2,3,4,5 |
#
sumsum(value)
- adds values ignoring missing data (e.g null
values).
#
Argumentsvalue
is any numeric value.
#
Return valueReturn value type is the same as the type of the argument.
#
Examplessum |
---|
100 |
item | count |
---|---|
apple | 53 |
orange | 47 |
#
Overflowsum
does not perform overflow check. To avoid overflow, you can cast the
argument to wider type.
#
variance / var_sampvar_samp(value)
- Calculates the sample variance of a set of values. The
sample variance is a measure of the amount of variation or dispersion of a set
of values in a sample from a population. A low variance indicates that the
values tend to be very close to the mean, while a high variance indicates that
the values are spread out over a wider range.
variance()
is an alias for var_samp
.
#
Argumentsvalue
is any numeric value.
#
Return valueReturn value type is double
.
#
Examplesstddev_samp |
---|
841.666666666666 |
#
var_popvar_pop(value)
- Calculates the population variance of a set of values. The
population variance is a measure of the amount of variation or dispersion of a
set of values. A low variance indicates that the values tend to be very close to
the mean, while a high variance indicates that the values are spread out over a
wider range.
#
Argumentsvalue
is any numeric value.
#
Return valueReturn value type is double
.
#
Examplesstddev_samp |
---|
833.25 |
#
vwapvwap(price, quantity)
- Calculates the volume-weighted average price (VWAP)
based on the given price and quantity columns. This is a handy replacement for
the sum(price * quantity) / sum(quantity)
expression.
#
Argumentsprice
is any numeric price value.quantity
is any numeric quantity value.
#
Return valueReturn value type is double
.
#
Examplesvwap |
---|
67 |