Contents
Python Wrapper for Datasketches
Installation
The easiest way to install the python wrapper is to run
pip install git+https://github.com/apache/incubator-datasketches-cpp.git
If you prefer to downlioad the source first, be sure to clone the repo with --recursive
to ensure you get the python binding library (pybind11):
git clone --recursive https://github.com/apache/incubator-datasketches-cpp.git
cd incubator-datasketches-cpp
pip install .
In the event you do not have pip
installed, you can invoke the setup script directly by replacing the last line above with python3 setup.py install
.
Usage
Having installed the library, loading the Datasketches library in Python is simple: from datasketches import *
.
Available Sketch Classes
- KLL
kll_ints_sketch
kll_floats_sketch
- Frequent Items
frequent_strings_sketch
- Error types are
frequent_items_error_type.{NO_FALSE_NEGATIVES | NO_FALSE_POSITIVES}
- Theta
update_theta_sketch
compact_theta_sketch
(cannot be instantiated directly)theta_union
theta_intersection
theta_a_not_b
- HLL
hll_sketch
hll_union
- Target HLL types are
tgt_hll_type.{HLL_4 | HLL_6 | HLL_8}
- CPC
cpc_sketch
cpc_union
Known Differences from C++
The Python API largely mirrors the C++ API, with a few minor exceptions: The primary known differences are that Python on modern platforms does not support unsigned integer values or numeric values with fewer than 64 bits. As a result, you may not be able to produce identical sketches from within Python as you can with Java and C++. Loading those sketches after they have been serialized from another language will work as expected.
We have also removed reliance on a builder class for theta sketches as Python allows named arguments to the constructor, not strictly positional arguments.