Extensions
 aggs_for_arrays 1.3.0
 Functions to compute statistics on arrays of numbers
Documentation
 CHANGELOG
 CHANGELOG
README
Contents
aggs_for_arrays
This Postgres extension provides various functions for operating on arrays, for instance taking the histogram of an array of numbers.
These functions are useful because if you have a lot values you want to aggregate,
queries that fetch each value from a separate row can have poor performance.
Storing all the values in a single row as a Postgres array
can drastically improve query performance.
For instance, computing a 1000bucket histogram on one million float values
stored in separate rows took 12 seconds in a simple benchmark,
compared to 27 milliseconds with the array_to_hist
function.
Using arrays in this way is a bit like a poorman's columnstore database: it lets you keep all the values for one attribute in one place. (It's also a bit like how in R and Pandas you often see parallel arrays rather than arrays of objects.) To simplify a little, imagine pulling one million floats off disk in a single 8MB chunk instead of asking the drive for one million separate reads. I wouldn't use this pattern if your arrays get updated a lot, lest you take a hit on writes (At least test first!), but if they are pretty stable then using arrays can greatly speed up reads.
With such an approach you could still use SQL or PLPGSQL, but these functions outperform such code, because the Postgres C API lets you skip a lot of the work for interfacing at those higher levels. For instance, the same benchmark gave 398ms for a SQL solution and 12 seconds for a plpgsql solution. We show further benchmark results below.
Note that despite the name, these functions are not true aggregate functions
(summarizing multiple rows).
Rather they do aggregatelike calculations on a single input array.
If you want actual aggregates that take multiple input arrays,
then you might be looking for my other extension, aggs_for_vecs
.
If this extension takes a columnstore approach to your data, that one takes a rowstore approach.
Installing
This package installs like any Postgres extension. First say:
make && sudo make install
You will need to have pg_config
in your path,
but normally that is already the case.
You can check with which pg_config
.
Then in the database of your choice say:
CREATE EXTENSION aggs_for_arrays;
Functions
The available functions are described below.
In general, these functions accept arrays of any integer or floatingpoint type,
namely SMALLINT
, INTEGER
, BIGINT
, REAL
, or DOUBLE PRECISION
(aka FLOAT
).
The return value will either be the same type (e.g. for a minimum),
a FLOAT (e.g. for a mean),
or an INTEGER
type (e.g. for histogram bucket counts).
If a function can take any numeric type,
its types are shown as T
.
INTEGER[] array_to_hist(values T[], bucket_start T, bucket_width T, bucket_count INTEGER)
Returns the bucket count based on the values and bucket characteristics you request.
INTEGER[] array_to_hist_2d(x_values T[], y_values T[], x_bucket_start T, y_bucket_start T, x_bucket_width T, y_bucket_width T, x_bucket_count INTEGER, y_bucket_count INTEGER)
Returns the bucket count as a 2D array based on the values and bucket characteristics you request.
The data arrays x_values
and y_values
must be the same length.
We compare each array's first element and plot it, then their second element, etc.
If either x_values
or y_values
is NULL
, the whole result is NULL
. If either contains a NULL
, then that position isn't plotted.
FLOAT array_to_mean(values T[])
Returns the mean of all the values in the array.
FLOAT array_to_median(values T[])
Returns the median. Does not require a presorted input. If there are an even number of values, returns the mean of the two middle values.
FLOAT sorted_array_to_median(values T[])
Just like array_to_median
, but assumes values
is already sorted.
FLOAT array_to_mode(values T[])
Returns the mode. Does not require a presorted input. If there are several values tied for most common, returns their mean.
FLOAT sorted_array_to_mode(values T[])
Just like array_to_mode
, but assumes values
is already sorted.
FLOAT array_to_percentile(values T[], percentile FLOAT)
Returns the percentile you request,
where percentile
is a number from 0 to 1 inclusive.
Asking for 0 will always give the minimum,
1 for maximum, and 0.5 the median.
If you ask for a percentile that lands between two data points,
we return a linear interpolation between them.
FLOAT sorted_array_to_percentile(values T[], percentile FLOAT)
Just like array_to_percentile
, but assumes values
is already sorted.
FLOAT[] array_to_percentiles(values T[], percentiles FLOAT[])
Just like array_to_percentile
,
but you can pass several percentiles
and get the result for each in a single call.
FLOAT[] sorted_array_to_percentiles(values T[], percentiles FLOAT[])
Just like array_to_percentiles
, but assumes values
is already sorted.
T array_to_max(values T[])
Returns the greatest value in the array.
T array_to_min(values T[])
Returns the least value in the array.
T[] array_to_min_max(values T[])
Returns a tuple with the min in position 1 and the max in position 2.
FLOAT array_to_skewness(values T[])
Computes the skewness of the given values.
FLOAT array_to_kurtosis(values T[])
Computes the kurtosis of the given values.
Benchmarks
Assume you have two tables:
CREATE TABLE samples (
id INTEGER PRIMARY KEY,
measurement_id INTEGER NOT NULL,
value FLOAT NOT NULL
);
CREATE TABLE sample_groups {
id INTEGER PRIMARY KEY,
measurement_id INTEGER NOT NULL,
values FLOAT[] NOT NULL
};
These tables store the same information,
but samples
stores each sample in a separate row,
and sample_groups
stores a whole group in just one row.
You can run bench.sh
to test the performance of various approaches:
 SQL on
samples
.  SQL on
sample_groups
.  PLPGSQL on
sample_groups
.  The
aggs_for_arrays
function onsample_groups
.
The sorted_array_to_*
methods use sorted_samples
and sorted_sample_groups
instead.
 function  SQL rowbased  SQL arraybased  PLPGSQL arraybased  aggs_for_arrays

:::::
 array_to_hist
 12218.1 ms  398.235 ms  12310.800 ms  26.936 ms 
 array_to_mean
 10630.0 ms  121.677 ms  390.983 ms  25.226 ms 
 array_to_median
 33587.0 ms  1163.070 ms  1258.160 ms  47.996 ms 
 sorted_array_to_median
 23239.5 ms  30.107 ms  41.225 ms  14.835 ms 
 array_to_mode
 13724.1 ms  1505.310 ms  1552.610 ms  201.943 ms 
 sorted_array_to_mode
 13195.2 ms  1474.130 ms  1577.770 ms  45.171 ms 
 array_to_percentile
 24218.2 ms  2591.240 ms  1698.570 ms  179.916 ms 
 sorted_array_to_percentile
 24305.5 ms  2102.520 ms  1204.140 ms  21.947 ms 
 array_to_percentiles
 32367.0 ms  10735.300 ms  3608.800 ms  188.752 ms 
 sorted_array_to_percentiles
 32294.3 ms  10153.300 ms  3120.830 ms  22.227 ms 
 array_to_max
 10613.2 ms  115.094 ms  398.791 ms  17.321 ms 
 array_to_min
 10600.5 ms  113.859 ms  400.926 ms  17.204 ms 
 array_to_min_max
 10727.9 ms  169.226 ms  824.539 ms  23.922 ms 
 array_to_skewness
 22267.2 ms  802.463 ms  1077.630 ms  120.925 ms 
 array_to_kurtosis
 22253.1 ms  806.296 ms  1075.960 ms  112.210 ms 