count_distinct

This Release
count_distinct 1.2.1
Date
Status
Stable
Other Releases
Abstract
Aggregate for computing number of distinct values using a hash table.
Description
The regular COUNT(DISTINCT ...) always performs a sort, which results in bad performance if there's a lot of duplicate values. This extension implements custom count_distinct aggregate function that uses a hash table to achieve the same purpose. So far there are only INT and BIGINT aggregates, but extending it to other data types should not be difficult.
Released By
tomasv
License
The (three-clause) BSD License
Resources
Special Files
Tags

Extensions

count_distinct 1.2.1

README

COUNT_DISTINCT aggregate

This extension provides a hash-based alternative to COUNT(DISTINCT ...) which for large amounts of data often ends in sorting and bad performance.

Functions

There's a single polymorphic aggregate function, handling all fixed length data types passed by value (i.e. up to 8B values on 64-bit machines):

  • count_distinct(p_value anyelement)

Extending the same approach to other data types (varlena or passed by reference) should be rather straight-forward and I'll do that eventually. But it's important to be very careful about memory consumption, as the hash-based approach keeps everything in RAM).

Performance

So, what's wrong with plain COUNT(DISTINCT ...). Let's use this table for some tests

CREATE TABLE test_table (id INT, val INT);

INSERT INTO test_table
     SELECT mod(i, 1000), (1000 * random())::int
       FROM generate_series(1,10000000) s(i);

ANALYZE test_table;

Now, let's try this query

SELECT id, COUNT(DISTINCT val) FROM test_table GROUP BY 1

which is executed like this

GroupAggregate  (cost=1443649.74..1518660.10 rows=1000 width=8)
  ->  Sort  (cost=1443649.74..1468649.86 rows=10000048 width=8)
        Sort Key: id
        ->  Seq Scan on test_table  (cost=0.00..144248.48 rows=...
(4 rows)

On my machine, it takes between 11.5 and 12 seconds, no matter what, and about ~90% of the time is spent on the sort. So let's see if we can do that without the sort faster using the COUNT_DISTINCT() aggregate:

SELECT id, COUNT_DISTINCT(val) FROM test_table GROUP BY 1

which results in an explain plan like this:

HashAggregate  (cost=194248.72..194261.22 rows=1000 width=8)
  ->  Seq Scan on test_table  (cost=0.00..144248.48 rows=10000048 ...
(2 rows)

This aggregate function takes ~4.1 seconds and produces exactly the same results (but unsorted).

Issues

The current implementation works only with fixed-length values passed by value (i.e. limited by the pointer size), but it should be rather simple to extend this to other data types. One way to overcome this limitation is hashing the value into a 32/64-bit integers, and then passing these hash values to count_distinct (see https://github.com/tvondra/pghashlib for a good library of hash functions). However be careful as this effectively turns count_distinct into an estimator.

If an estimator is sufficient for you, maybe postgresql-hll or one of the estimators at distinct_estimators would be a better solution for you?

With the previous implementation (based on hash tables), memory consumption was a big problem. For example when counting 80M unique 32-bit integers, it was common to see more than 5GB of RAM allocated (which is way more than the 320MB necessary for the values, and ~1.6GB when including some hash table related overhead (buckets, pointers, ...). This was mostly due to clashing with MemoryContext internals, etc.

With the new implementation significantly improves this, and the memory consumption is a fraction (usually less than 10-20% of what it used to be).

Still, it may happen that you run out of memory. It's not very likely because for large number of groups planner will switch to GroupAggregate (effectively keeping a single group in memory), but it's possible.

Sadly, that is not something the extension could handle internally in a reasonable way. The only actual solution is to implement this into HashAggregate itself (some people are working on this, but don't hold your breath - it won't happen before 9.5).

So in short - if you're dealing with a lot of distinct values, you need a lot of RAM in the machine.

Installation

Installing this is very simple, especially if you're using pgxn client. All you need to do is this:

$ pgxn install count_distinct
$ pgxn load -d mydb count_distinct

and you're done. You may also install the extension manually:

$ make install
$ psql dbname -c "CREATE EXTENSION count_distinct"

And if you're on an older version (pre-9.1), you have to run the SQL script manually

$ psql dbname < `pg_config --sharedir`/contrib/count_distinct--1.0.0.sql

That's all.

License

This software is distributed under the terms of BSD 2-clause license. See LICENSE or http://www.opensource.org/licenses/bsd-license.php for more details.