datasketches 1.0.0

This Release

datasketches 1.0.0

Date

2019-01-26

Status

Stable

Latest Stable

datasketches 1.7.0 — 2023-12-14

Other Releases

Abstract

approximate algorithms for big data analysis

Released By

saydakov

License

PostgreSQL

Resources

Special Files

Extensions

cpc_sketch 1.0.0: CPC sketch for approximate distinct counting
kll_float_sketch 1.0.0: KLL quantiles sketch for approximating distributions of float values (quanitles, ranks, histograms)

README

Examples

Distinct counting with CPC sketch

Suppose 100 million random integer values uniformly distributed in the range from 1 to 100M have been generated and inserted into a table

Exact count distinct:

$ time psql test -c "select count(distinct id) from random_ints_100m;"
  count
----------
 63208457
(1 row)

real    1m59.060s

Approximate count distinct:

$ time psql test -c "select cpc_sketch_distinct(id) from random_ints_100m;"
 cpc_sketch_distinct 
---------------------
    63423695.9451363
(1 row)

real    0m20.680s

Note that the above one-off distinct count is just to show the basic usage. Most importantly, the sketch can be used as an "additive" distinct count metric in a data cube.

Merging sketches:

create table cpc_sketch_test(sketch cpc_sketch);
insert into cpc_sketch_test select cpc_sketch_build(1);
insert into cpc_sketch_test select cpc_sketch_build(2);
insert into cpc_sketch_test select cpc_sketch_build(3);
select cpc_sketch_get_estimate(cpc_sketch_merge(sketch)) from cpc_sketch_test;
cpc_sketch_get_estimate
-------------------------
        3.00024414612919

Estimating quanitles and ranks with KLL sketch

Table "normal" has 1 million values from the normal distribution with mean=0 and stddev=1. We can build a sketch, which represents the distribution (create table kll_float_sketch_test(sketch kll_float_sketch)):

$ psql test -c "insert into kll_float_sketch_test select kll_float_sketch_build(value) from normal";
INSERT 0 1

We expect the value with rank 0.5 (median) to be approximately 0:

$ psql test -c "select kll_float_sketch_get_quantile(sketch, 0.5) from kll_float_sketch_test";
 kll_float_sketch_get_quantile 
-------------------------------
                    0.00648344

In reverse: we expect the rank of value 0 (true median) to be approximately 0.5:

$ psql test -c "select kll_float_sketch_get_rank(sketch, 0) from kll_float_sketch_test";
 kll_float_sketch_get_rank 
---------------------------
                  0.496289

Note that the normal distribution was used just to show the basic usage. The sketch does not make any assumptions about the distribution.

PGXN

PostgreSQL Extension Network

datasketches 1.0.0

Extensions

README

Contents

Examples

Distinct counting with CPC sketch

Estimating quanitles and ranks with KLL sketch