Contents
- In-Memory Columnar Store (IMCS)
- Introduction
- Overview
- Functions
- General columnar store functions
- Generated data manipulation functions
- Generated data access functions for single timeseries
- Generated data access functions for multiple timeseries (identified by timeseries ID)
- Timeseries constructors
- Binary operations
- Unary operations
- Mathematical functions
- Date/time functions
- Binary scalar functions
- Timeseries transformation functions
- Grand aggregates
- Group-by aggregates
- Grid aggregates
- Window (moving) aggregates
- Hash aggregates (group-by using hash function)
- Cumulative aggregates
- Sort functions
- Special functions
- Operators
- Projection issues
- Implementation details
- Installation and tuning
- Performance comparison
- License
In-Memory Columnar Store (IMCS)
- Introduction
- Overview
- Functions
- General columnar store functions
- Generated data manipulation functions
- Generated data access functions for single timeseries
- Generated data access functions for multiple timeseries (identified by timeseries ID)
- Timeseries constructors
- Binary operations
- Unary operations
- Mathematical functions
- Date/time functions
- Binary scalar functions
- Timeseries transformation functions
- Grand aggregates
- Group-by aggregates
- Grid aggregates
- Window (moving) aggregates
- Hash aggregates (group-by using hash function)
- Cumulative aggregates
- Sort functions
- Special functions
- Operators
- Projection issues
- Implementation details
- Installation and tuning
- Performance comparison
- License
Introduction
Columnar store or vertical representation of data allows to achieve better performance in comparison with classical horizontal representation due to three factors:
- Reducing size of fetched data: only columns involved in query are accessed.
- Vector operations. Applying an operator to set of values (tile) makes it possible to minimize interpretation cost. Also SIMD instructions of modern processors accelerate execution of vector operations.
- Compression of data. Certainly compression can also be used for all the records, but independent compression of each column can give much better results without significant extra CPU overhead. For example such simple compression algorithm like RLE (run-length-encoding) allows not only to reduce used space, but also minimize number of performed operations.
Overview
As it is clear from the abbreviation (IMCS: In-Memory Columnar Store) this plugin adds to PostgreSQL in-memory columnar store. So vertical representation of data is complementary to standard horizontal representation. Data is imported in PostgreSQL database in usual way and is stored in normal table. Then columns from this table are fetched and stored in shared memory. IMCS provides a lot of timeseries functions which can be used for data analysis. Operations with timeseries are performed in vector mode allowing to reach maximal possible speed of such operations. Also IMCS makes it possible to parallelize execution of some queries (for example calculation of aggregates) and utilize all CPU cores. All this three factors: in-memory location of data, vector operations, parallel query execution, makes it possible to increase speed of some queries more than 100 times comparing with standard PostgreSQL queries.
To make access to timeseries as convenient as possible, IMCS provides generator of access functions. You should specify name of source table or view (from which data will be imported), name of timestamp field (this is a main key by which timeseries elements are accessed) and optionally timeseries identifier. The last one needs some explanations. In some cases all data from the table should be placed in a single timeseries. For example assume that we collect data about phone calls (date, duration, caller, callee,...). It will be a single timeseries. But for example in trading systems there are separate data (ticks) associated with each symbol. So we have separate timeseries for ABB, GOOG, IBM, YHOO,... In this case securities identifier (symbol) can be considered as identifier of timeseries.
IMCS supports the following element type for timeseries: "char", int2, int4, date, int8, time, timestamp, float4, float8, bpchar
.
All timeseries elements should have the same size, so only fixed size character types are supported: for example char(10)
, but not varchar
. Also IMCS is not able to represent NULL values. It is not enforced that fields of the source table were declared as NOT NULL
, but attempt to insert NULL value in timeseries will cause error (or optionally NULL can be substituted with zero). Please use default values instead of NULLs.
Given all this information IMCS generates corresponding types and functions for loading/appending/accessing this timeseries.
Assume that we have table Quote
. After calling cs_create('Quote', 'Day', 'Symbol')
we will get Quote_load()
function for loading data from table in memory,
Quote_get(symbol char(10), low date, high date)
function for fetching/slicing corresponding timeseries and triggers which will keep track updates in Quote
table and propagate this changes to timeseries.
There are two ways of synchronizing original table and timeseries:
- Automatic: using triggers. In this case all inserts/deletes in original table are immediately reflected in timeseries.
- Manual: using explicit invocation of load/append/delete methods.
load()
is significantly more efficient than propagation of updates using triggers. Mostly because of slowness of PL/pgSQL.
Also please notice, that been stored in shared memory, timeseries have to be reloaded after restart of
the server. Unfortunately PostgreSQL doesn't support database level triggers (like after startup on database
in Oracle).
IMCS provides two alternatives: use autoload mode or manually load data. In case of using autoload mode, data will be automatically loaded from table to columnar store on demand when it is first accessed by any query. Please notice that for large tables loading data can take substantial amount of time and so increase execution time of the query initiated this load (it can confuse an user which expects this query to complete very fast).
Fortunately database servers are not restarted frequently...
When data is loaded from the table, records are sorted by timestamp and inserted in ascending order. You can append data to existed timeseries, but timestamps of inserted elements should be greater than already loaded. When timeseries is populated using insert trigger it is necessary to enforce that the data is inserted in the table in timestamp ascending order. Otherwise out-of-order error will be reported while inserting element in timeseries.
TABLE_get
functions returns row of type TABLE_timeseries
(this type is also generated by IMCS) which has the same columns as original table, but type of this columns is timeseries
. So it is possible to refer to this timeseries as to any other columns and apply timeseries functions to them. For example query:
select cs_max(Close) from Quote_get('IBM');returns maximal close price for IBM. IMCS provide standard operators for timeseries type, allowing to write queries with more complex expressions in standard way:
select cs_avg(High - (Open + Close)/2) from Quote_get('IBM');Result of the query above is scalar value (because of used grand aggregate). But most of timeseries functions take timeseries as input and return also timeseries. For example result of the query below is timeseries:
select cs_filter(Open When you print result of execution of this query at the screen (for example by running this query in psql), it will be represented as large string literal in braces: 'date:{01/01/2010, 01/02/2010,...}' Certainly it is not convenient for really large timeseries and may even cause memory exhaustion. Alternatively it is possible to change vertical representation back to horizontal representation usingTABLE_project
orcs_project
functions. Then produced tuples can be accessed in normal way using all SQL stuff. For example it is possible to sort them or perform more grouping/filtering.Functions
General columnar store functions
General columnar store functions are used to generate table-specific API functions, get information about columnar store and perform cleanup.
Function | Description |
---|---|
function cs_create(table_name text, timestamp_id text, timeseries_id text default null, autoupdate bool default false) returns void |
This function is used to generate all API functions, types and triggers for the specified table or view table_name .
These can be latter removed using table_name_drop function. timestamp_id is name of timestamp field by which timeseries elements are sorted in ascending order, allowing to efficiently extract time slices. timeseries_id is optional field identifying timeseries. For example for quotes it can be a symbol name. If this field is specified, then separate timeseries will be maintained for each symbol. If autoupdate parameter is true, then IMCS will create triggers which automatically update timeseries when new data is added/deleted to/from the source table. Alternatively it is possible to explicitly load/append/delete data to timeseries. Please notice that explicit bulk update/delete is significantly more efficient than row-level updated performed by trigger. If columnar store interface for a table was generated with autoupdate=false , then triggers are still generated but are disabled. You can enable them later using alter table TABLE enable trigger user command. As far as views cannot have row-level BEFORE or AFTER triggers in PostgreSQL, IMCS doesn't generate them if table_name is a view. |
function cs_delete_all() returns bigint |
Deletes all timeseries in columnar store. This function can be used for most efficient cleanup of columnar store. Please notice that PostgreSQL doesn't allow to free shared memory, so it still be in use. But it can be reused in subsequent allocation requests of columnar store. This function returns total number of removed elements (in all timeseries) |
function cs_used_memory() returns bigint |
Returns amount of memory used by columnar store. |
function cs_profile(reset bool default false) setof cs_profile_item |
Returns number of calls of each IMCS command. If parameter is true, then all counters
are reset after execution of this call. |
Generated data manipulation functions
Generated functions for loading/storing/deleting timeseries.
Function | Description |
---|---|
function TABLE_drop() returns void |
Deletes all generated functions and types for table TABLE. |
function TABLE_load(already_sorted bool default false, filter text default null) returns bigint |
Populates timeseries with data from PostgreSQL table. If already_sorted parameter is true, then it is assumed that
records in the table are stored in proper (timestamp ascending) order. Otherwise IMCS will add "order by" clause
to select statement. Please notice that PostgreSQL vacuuming can change original order of the records. So disable vacuuming for the
table if you want to preserve insert order. Optional filter parameter allows to specify additional selection criteria for table records. It allows to include in timeseries only some subset of the table.
Particularly it can be used to append existed timeseries with new data.
This function returns number of inserted timeseries elements. If filter is not specified then this function loads data from the table only if timeseries are not yet initialized. If filter is not null, then this functions always tries to load data, assuming that programmer has specified proper filter condition allowing to avoid duplicates and preserve proper timeseries order. If filter is null and timeseries are already initialized, then this function does nothing and immediately returns zero. |
function TABLE_append(start_from TIMESTAMP_TYPE) returns bigint |
Appends to timeseries records from the source table starting from start_from timestamp (inclusive).
Use this function if on-update trigger is disabled (autoupdate=false in parameters of cs_create ).
Please also notice that this function is implemented in PL/pgSQL and so it is significantly slower than TABLE_load with the same filter condition.
This function returns number of added timeseries elements. |
function TABLE_truncate() returns void |
Truncates all timeseries for this table. This is most efficient way to delete vertical representation for the specific table.
If you need to delete all data in columnar store, better use cs_delete_all() function.
|
Generated data access functions for single timeseries
Functions generated for accessing single timeseries (timeseries having no identifier).
Function | Description |
---|---|
function TABLE_first() returns TIMESTAMP_TYPE |
Returns oldest timestamp. |
function TABLE_last() returns TIMESTAMP_TYPE |
Returns most recent timestamp. |
function TABLE_count() returns bigint |
Returns number of elements in timeseries. |
function TABLE_get(low TIMESTAMP_TYPE default null, high TIMESTAMP_TYPE default null) returns TABLE_timeseries |
Returns vertical representation of the whole table or its time slice. Returned record contains the same columns as record of the original table, but them have timeseries type instead of original scalar types. These columns can be used in timeseries functions (cs_*). If high or low parameters are not null, then them specify correspondingly upper/lower inclusive boundary for timestamp value. If some or both parameters are omitted, then corresponding boundary is open. |
function TABLE_span(from_pos bigint default 0, till_pos bigint default 9223372036854775807) returns TABLE_timeseries |
Returns vertical representation of the whole table or its horisonal slice. Returned record contains the same columns as record of the original table, but them have timeseries type instead of original scalar types. These columns can be used in timeseries functions (cs_*).
Parameter from_pos specifies start position in timeseries (inclusive) and parameter till_pos specifies end position (inclusive). If till_pos parameter is missed, then subsequence spans till end of timeseries. Values of both from_pos and till_pos parameters can be negative. In this case position is calculated from end of timeseries, i.e. TABLE_span(from_pos:=-1) extracts last element of the timeseries. |
function TABLE_delete(low TIMESTAMP_TYPE, high TIMESTAMP_TYPE) returns bigint |
Deletes timeseries elements belonging to the specified interval. If high or low parameters are not null, then them specify correspondingly upper/lower inclusive boundary for timestamp value. If some or both parameters are nulls, then corresponding boundary is open. This function returns number of deleted elements. |
function TABLE_delete(till TIMESTAMP_TYPE default null) returns bigint |
Deletes timeseries elements from the beginning till specified timestamp till (inclusive) or delete all elements if this parameter is null/omitted.
This function is equivalent to TABLE_delete(null, till) . IMCS provides separate function for it because it is intended to be the
most frequent case of deleting elements from timeseries: it correponds to shifting data window when new elements are appended and
deteriorated are thrown away. This function returns number of deleted elements. |
function TABLE_project(input TABLE_timeseries input, positions timeseries default null,disable_caching bool default false) returns setof TABLE |
Makes horizontal projection of timeseries. Optional positions parameter specifies positions of selected timeseries elements.
If parameter positions is omitted, then all timeseries elements are transformed to horizontal representation.
So this function is opposite to TABLE_get() : get transforms horizontal representation to vertical and project does backward transformation. It is possible to use this function only if number of columns returned by TABLE_get()
and element types of corresponding timeseries are not changed. For example it is possible to run query like this:
select (Quote_project(abb.*,cs_top_max_pos(Close, 10))).* from Quote_get('ABB', date('01-Jan-2010'), date('31-Mar-2010')) abb; select (Quote_project(abb.*)).* from (select Symbol,Day,cs_maxof(Open,Close),High,Low,cs_minof(Open,Close),Volume from Quote_get('ABB')) abb;but not select (Quote_project(abb.*)).* from (select Symbol,cs_maxof(Open,Close) from Quote_get('ABB')) abb;In the last case it is possible to use cs_project() function:
select cs_project(abb.*) from (select Symbol,cs_maxof(Open,Close) from Quote_get('ABB')) abb;Please notice that we can not use ().* clause in this case because cs_project returns anonymous row.
But in PostgreSQL 9.3 we can use lateral join:
select p.* from (select Symbol,cs_maxof(Open,Close) from Quote_get('ABB')) abb, cs_project(abb.*) p(symbol char(10), max real);Please find more information about projection of timeseries, problems with (...).* construction in PostgreSQL and
purpose of disable_caching parameter in section Projection issues.
|
Generated data access functions for multiple timeseries (identified by timeseries ID)
Functions generated for accessing multiple timeseries (source table contains identifier of timeseries, for example 'Symbol').
Function | Description |
---|---|
function TABLE_first(id TIMESERIES_ID_TYPE) returns TIMESTAMP_TYPE |
Returns oldest timestamp. |
function TABLE_last(id TIMESERIES_ID_TYPE) returns TIMESTAMP_TYPE |
Returns most recent timestamp. |
function TABLE_count(id TIMESERIES_ID_TYPE) returns bigint |
Returns number of elements in timeseries. |
function TABLE_get(id TIMESERIES_ID_TYPE, TIMESTAMP_TYPE low default null, TIMESTAMP_TYPE high default null) returns TABLE_timeseries |
Returns timeseries with specified identifier for the corresponding table or its time slice. Returned record contains the same columns as record of original table, but them have timeseries type instead of original scalar types. These columns can be used in timeseries functions (cs_*). If high or low parameters are not null, then them specify correspondingly upper/lower inclusive boundary for timestamp value. If some or both parameters are omitted, then corresponding boundary is open. |
function TABLE_get(id TIMESERIES_ID_TYPE[], TIMESTAMP_TYPE low default null, TIMESTAMP_TYPE high default null) returns setof TABLE_timeseries |
Does the same as function described above but for array of timeseries identifiers. For each timeseries identifier this function returns TABLE_timeseries record, so output will contain as much rows as there are identifiers. |
function TABLE_span(id TIMESERIES_ID_TYPE, from_pos bigint default 0, till_pos bigint default 9223372036854775807) returns TABLE_timeseries |
Returns timeseries with specified identifier for the corresponding table or its horisonal slice. Returned record contains the same columns as record of the original table, but them have timeseries type instead of original scalar types. These columns can be used in timeseries functions (cs_*).
Parameter from_pos specifies start position in timeseries (inclusive) and parameter till_pos specifies end position (inclusive). If till_pos parameter is missed, then subsequence spans till end of timeseries. Values of both from_pos and till_pos parameters can be negative. In this case position is calculated from end of timeseries, i.e. TABLE_span(id,from_pos:=-1) extracts last element of the timeseries. |
function TABLE_span(id TIMESERIES_ID_TYPE[], from_pos bigint default 0, till_pos bigint default 9223372036854775807) returns setof TABLE_timeseries |
Does the same as function described above but for array of timeseries identifiers. For each timeseries identifier this function returns TABLE_timeseries record, so output will contain as much rows as there are identifiers. |
function TABLE_concat(id TIMESERIES_ID_TYPE[], TIMESTAMP_TYPE low default null, TIMESTAMP_TYPE high default null) returns TABLE_timeseries |
Concatenates slices of timeseries for the specified identifiers. Returned record contains the same columns as record of original table, but them have timeseries type instead of original scalar types. Each such timeseries is concatenation of slices of timeseries for all specified identifiers. These columns can be used in timeseries functions (cs_*). If high or low parameters are not null, then them specify correspondingly upper/lower inclusive boundary for timestamp value. If some or both parameters are omitted, then corresponding boundary is open. |
function TABLE_delete(id TIMESERIES_ID_TYPE, low TIMESTAMP_TYPE, high TIMESTAMP_TYPE) returns bigint |
Deletes timeseries elements belonging to the specified interval. If high or low parameters are not null, then them specify correspondingly upper/lower inclusive boundary for timestamp value. If some or both parameters are nulls, then corresponding boundary is open. This function returns number of deleted elements. |
function TABLE_delete(id TIMESERIES_ID_TYPE, till TIMESTAMP_TYPE default null) returns bigint |
Deletes timeseries elements from the beginning till specified timestamp till (inclusive) or delete all elements if this parameter is null/omitted.
This function is equivalent to TABLE_delete(id, null, till) . IMCS provides separate function for it because it is intended to be the
most frequent case of deleting elements from timeseries: it correponds to shifting data window when new elements are appended and
deteriorated are thrown away. This function returns number of deleted elements. |
function TABLE_project(input TABLE_timeseries input, positions timeseries default null,disable_caching bool default false) returns setof TABLE |
Makes horizontal projection of timeseries. Optional positions parameter specifies positions of timeseries elements.
If skipped, then all timeseries elements are transformed to horizontal representation.
So this function is opposite to TABLE_get: get transforms horizontal representation to vertical and project does backward transformation. See comments to project function in the previous section. |
Timeseries constructors
Functions constructing constant timeseries (timeseries of repeated value) or timseries created by parsing string literal.
Function | Description |
---|---|
function cs_parse(str text, elem_type cs_elem_type, elem_size integer default 0) returns timeseries |
Creates timeseries from string, for example '{1,2,3,4,5}'. Type of timeseries is specified by elem_type parameter. For timeseries of characters it is also necessary to specify size of timeseries element - elem_size .
Please notice that PostgreSQL allows implicit cast from string to the target type using this type input function, but in this case information about timeseries element type and size should be encoded in the string: 'int4:{1,2,3,4,5}' . |
function cs_const(val float8, elem_type cs_elem_type) returns timeseries |
Creates timeseries of numeric (integer or floating point) elements. Type of timeseries is specified by elem_type parameter.
Should be one of: 'char', 'int2', 'int4', 'int8', 'float4', 'float8' . |
function cs_const(val timestamp, elem_type cs_elem_type) returns timeseries |
Creates timeseries of date/time elements. Type of timeseries is specified by elem_type parameter.
Should be one of: 'date', 'time', 'timestamp' . |
function cs_const(val text, elem_size integer) returns timeseries |
Creates timeseries of character type. Size of timeseries element is specified by elem_size parameter.
|
function cs_const(val text) returns timeseries |
Creates timeseries of character type. Size of timeseries element is equal to the length of val .
|
Binary operations
Binary operations with timeseries. These functions take two timeseries arguments and return result timeseries. IMCS tries to automatically adjust types of input arguments (for example if one timeseries has "int8" element type and another - "float8", then first one will be converted to "float8").
Function | Description |
---|---|
function cs_add(timeseries,timeseries) returns timeseries |
Adds elements of two timeseries |
function cs_sub(timeseries,timeseries) returns timeseries |
Subtracts elements of two timeseries |
function cs_mul(timeseries,timeseries) returns timeseries |
Multiplies elements of two timeseries |
function cs_div(timeseries,timeseries) returns timeseries |
Divides elements of two timeseries |
function cs_pow(timeseries,timeseries) returns timeseries |
Raises element of first timeseries to power specified by element of second timeseries |
function cs_and(timeseries,timeseries) returns timeseries |
Bitwise AND of elements of two integer or boolean timeseries |
function cs_or(timeseries,timeseries) returns timeseries |
Bitwise OR of elements of two integer or boolean timeseries |
function cs_xor(timeseries,timeseries) returns timeseries |
Bitwise XOR of elements of two integer or boolean timeseries |
function cs_eq(timeseries,timeseries) returns timeseries |
Checks if element of first timeseries is equal to element of second timeseries |
function cs_ne(timeseries,timeseries) returns timeseries |
Checks if element of first timeseries is not equal to element of second timeseries |
function cs_gt(timeseries,timeseries) returns timeseries |
Checks if element of first timeseries is greater than element of second timeseries |
function cs_ge(timeseries,timeseries) returns timeseries |
Checks if element of first timeseries is greater or equal than element of second timeseries |
function cs_lt(timeseries,timeseries) returns timeseries |
Checks if element of first timeseries is less than element of second timeseries |
function cs_le(timeseries,timeseries) returns timeseries |
Checks if element of first timeseries is less or equal than element of second timeseries |
function cs_maxof(timeseries,timeseries) returns timeseries |
Maximum of two elements |
function cs_minof(timeseries,timeseries) returns timeseries |
Minimum of two elements |
function cs_like(timeseries,pattern text) returns timeseries |
Finds elements of character timeseries matching specified pattern (case sensitive). Rules of matching are the same as for PostgreSQL LIKE predicate. |
function cs_ilike(timeseries,pattern text) returns timeseries |
Finds elements of character timeseries matching specified pattern (ignore case). Rules of matching are the same as for PostgreSQL ILIKE predicate. |
function cs_cat(timeseries,timeseries) returns timeseries |
Concatenates elements of two timeseries. Input timeseries can have any element type, result is always timeseries of characters which element size
is equal to sum of element sizes of concatenated timeseries. For example cs_cat('bpchar1:{a,b,c}', 'bpchar1:{x,y,z}') = 'bpchar2:{ax,by,cz}' .
In case of concatenation of character strings which actual length is smaller than fixed element size, result will contains filler character ('\0').
So if element size of concatenated timeseries in the above example is 3, then result will be E'{a\\000\\000x\\000\\000,b\\000\\000y\\000\000,c\\000\\000z\\000\000}' . If you prefer to get '{ax,by,cz}' , then please use cs_add instead of cs_cat .
Function cs_cat is intended to be used for concatenation of group-by keys (character or numeric) for aggregation.
|
function cs_concat(head timeseries,tail timeseries) returns timeseries |
Concatenates two timeseries. Result of this function is timeseries containing elements both of head and tail timeseries.
For example cs_concat('int4:{1,2,3}','int4:{4,5,6}') = '{int4:1,2,3,4,5,6}' .
Parameters head or tail may be null. In this case cs_concat returns just not-null timeseries.
|
Unary operations
Unary operations with timeseries. These functions take single timeseries and return timeseries as result.
Function | Description |
---|---|
function cs_neg(timeseries) returns timeseries |
Negates timeseries elements |
function cs_not(timeseries) returns timeseries |
Logical NOT of boolean timeseries elements |
function cs_bit_not(timeseries) returns timeseries |
Bitwise NOT of integer timeseries elements |
function cs_abs(timeseries) returns timeseries |
Absolute value of timeseries element |
function cs_norm(timeseries) returns timeseries |
Normalizes timeseries elements (divides each element by square root of sum of all elements) |
Mathematical functions
Calculation of mathematical functions for all timeseries elements.
Function | Description |
---|---|
function cs_sin(timeseries) returns timeseries |
Sine function |
function cs_cos(timeseries) returns timeseries |
Cosine function |
function cs_tan(timeseries) returns timeseries |
Tangent function |
function cs_exp(timeseries) returns timeseries |
Exponent function |
function cs_asin(timeseries) returns timeseries |
Arcsine function |
function cs_acos(timeseries) returns timeseries |
Arccosine function |
function cs_atan(timeseries) returns timeseries |
Arctangent function |
function cs_sqrt(timeseries) returns timeseries |
Square root function |
function cs_log(timeseries) returns timeseries |
Natural logarithm function |
function cs_ceil(timeseries) returns timeseries |
Rounds timeseries element to the smallest integer greater or equal than the element value |
function cs_floor(timeseries) returns timeseries |
Rounds timeseries element to the largest integer less or equal than the element value |
function cs_isnan(timeseries) returns timeseries |
Checks if floating point timeseries element is NaN |
Date/time functions
Extracts components of date/time type. These functions are mostly needed in group-by operations to calculate aggregates for various intervals (days, weeks, months, quarters, years...).
Function | Description |
---|---|
function cs_year(timeseries) returns timeseries |
Extracts year from date/timestamp |
function cs_month(timeseries) returns timeseries |
Extracts month (1..12) from date/timestamp |
function cs_mday(timeseries) returns timeseries |
Extracts month day (1..31) from date/timestamp |
function cs_wday(timeseries) returns timeseries |
Extracts week day (0..6 starting from Sunday) from date/timestamp |
function cs_week(timeseries) returns timeseries |
Extracts week number since start of epoch from date/timestamp |
function cs_quarter(timeseries) returns timeseries |
Extracts quarter (1..4) from date/timestamp |
function cs_hour(timeseries) returns timeseries |
Extracts hour (0..23) from time/timestamp |
function cs_minute(timeseries) returns timeseries |
Extracts minute (0..59) from time/timestamp |
function cs_second(timeseries) returns timeseries |
Extracts second (0..59) from time/timestamp |
Binary scalar functions
Functions of this group take two timeseries arguments and calculate single scalar value as result. IMCS tries to automatically adjust types of input arguments (for example if one timeseries has "int8" element type and another - "float8", then first one will be converted to "float8").
Execution of these functions can be parallelized.
Function | Description |
---|---|
function cs_wsum(timeseries,timeseries) returns float8 |
Weighted sum of timeseries elements |
function cs_wavg(a timeseries,b timeseries) returns float8 |
Weighted average of timeseries elements: sum(a*b)/sum(a) |
function cs_corr(a timeseries,b timeseries) returns float8 |
Correlation of two timeseries |
function cs_cov(a timeseries,b timeseries) returns float8 |
Covariation of two timeseries |
Timeseries transformation functions
Functions performing various transformations of input timeseries.
Function | Description |
---|---|
function cs_cast(input timeseries, elem_type cs_elem_type) returns timeseries |
Casts timeseries elements to the specified type defined in cs_elem_type enum: ('char', 'int2', 'int4', 'date', 'int8', 'time', 'timestamp', 'float4', 'float8', 'bpchar'). Explicit casts are rarely needed, in most cases IMCS performs implicit type conversion. |
function cs_to_TYPE_array(timeseries) returns TYPE[] |
Converts timeseries to array. TYPE should be one of "char", int2, int4, date, int8, time, timestamp, float4, float8, bpchar and should match element type of the converted timeseries. Please notice that array is constructed in memory and large timeseries can cause memory overflow. |
function cs_from_array(anyarray, elem_size integer default 0) returns timeseries |
Converts array to timeseries. This function creates timeseries iterator for the input array, allowing to apply to it any timeseries functions. Type of the result timeseries element is the same as type of the array element.
Optional elem_size parameter is needed only for text array, it should specify maximal size of array element. |
function cs_thin(timeseries, origin integer, step integer) returns timeseries |
Leaves only each step -th element of timeseries starting from origin . |
function cs_limit(timeseries, from_pos bigint default 0, till_pos bigint default 9223372036854775807) returns timeseries |
Extracts subsequence from timeseries. Parameter from_pos specifies start position of subsequence (inclusive) and parameter till_pos specifies end position (inclusive). If till_pos parameter is missed, then subsequence spans till end of timeseries. Values of both from_pos and till_pos parameters can be negative. In this case position is calculated from end of timeseries, i.e. cs_limit(s, from_pos:=-1) extracts last element of the timeseries. |
function cs_head(timeseries, n bigint) returns timeseries |
Extracts n first elements of timeseries. This function is equivalent to cs_limit(0, n-1) . |
function cs_tail(timeseries, n bigint) returns timeseries |
Extracts n last elements of timeseries. This function is equivalent to cs_limit(-n) . |
function cs_cut_head(timeseries, n bigint) returns timeseries |
Extracts all except first n elements of timeseries. This function is equivalent to cs_limit(n) . |
function cs_cut_tail(timeseries, n bigint) returns timeseries |
Extracts all except last n elements of timeseries. This function is equivalent to cs_limit(0,-n-1) . |
function cs_call(input timeseries, func oid) returns timeseries |
Calls specified function for all elements of input timeseries.
To specify function cast function name to regproc or function prototype (name and argument types) to recprocedure :
select cs_call(Close, 'sin'::regproc) from Quote_get('IBM'); select cs_call(Close, 'sin(float)'::regprocedure) from Quote_get('IBM');Please notice that sin is taken here only as example. There is special cs_sin in IMCS API.
But you can specify here name of any function, including plpgsql function:
create function mul2(x real) returns real as $$ begin return x*2; end; $$ language plpgsql strict immutable; select cs_call(Close, 'mult2'::regproc) from Quote_get('IBM'); |
function cs_union(left timeseries, right timeseries) |
Unions two sorted timeseries (usually timestamps). For example cs_union('int8:{1,5,7,8}', 'int8:{2,3,5,6}') = 'int8:{1,2,3,5,5,6,7,8}' |
function cs_iif(cond timeseries, then_ts timeseries, else_ts timeseries) returns timeseries |
Chooses one of two alternatives: if element of cond boolean timeseries is true, then use element of then_ts timeseries, otherwise use element of else_ts timeseries. All timeseries are traversed with the same speed: if we take element from then_ts timeseries, then corresponding element of else_ts timeseries is skipped.
For example cs_iif('char:{1,0,1}', 'float4:{1.0,2.0,3.0}', 'float4:{0.1,0.2,0.3}') = 'float4:{1.0,0.2,3.0}' |
function cs_if(cond timeseries, then_ts timeseries, else_ts timeseries) returns timeseries |
Conditional computation: if element of cond boolean timeseries is true, then take next element of then_ts timeseries, otherwise use element of else_ts timeseries. Unlike cs_iff then/else timeseries are accessed only on demand, so number of elements fetched from this timeseries depends on condition.
For example cs_if('char:{1,0,1}', 'float4:{1.0,2.0,3.0}', 'float4:{0.1,0.2,0.3}') = 'float4:{1.0,0.1,2.0}' |
function cs_filter(cond timeseries, input timeseries) returns timeseries |
Leaves only those elements from timeseries input for which condition cond is true.
For example cs_filter('char:{1,0,1}', 'float4:{1.0,2.0,3.0}') = 'float4:{1.0,3.0}' |
function cs_filter_pos(cond timeseries) returns timeseries |
Returns positions of timeseries elements for which condition cond is true.
For example cs_filter_pos('char:{1,0,1}') = 'int8:{0,2}' |
function cs_unique(timeseries) returns timeseries |
Removes subsequent duplicate values. To eliminate all duplicates in timeseries it should be sorted prior applying cs_unique . For example cs_unique('int4:{1,1,2,2,2,1,3}') = 'int4:{1,2,1,3}' |
function cs_reverse(timeseries) returns timeseries |
Reverses order of timeseries elements .For example cs_reverse('int4:{1,2,3}') = 'int4:{3,2,1}' |
function cs_diff(input timeseries) returns timeseries |
Calculates difference between pairs of subsequent timeseries elements: result[0] = input[0], result[i] = input[i] - input[i-1] . For example cs_diff('int8:{1,3,2,5}) = 'int8:{1,2,-1,3} |
function cs_diff0(input timeseries) returns timeseries |
Calculates difference between pairs of subsequent timeseries elements: result[0] = 0, result[i] = input[i] - input[i-1] . For example cs_diff0('int8:{1,3,2,5}) = 'int8:{0,2,-1,3} |
function cs_project(anyelement, positions timeseries default null, disable_caching book default false) returns setof record |
Transforms vertical representation (all timeseries elements or just elements on specified positions) to horizontal representation. This is more generic version of TABLE_project which can be applied to arbitrary set of columns.
But as far as result row is anonymous, it is not possible to unnest it using PostgreSQL ().* clause.
In PostgreSQL 9.3 it is possible to use cs_project in FROM list (lateral join) providing alias with
description of returned columns.
Concerning optional disable_caching parameter please read section Projection issues.
|
function cs_project_agg(anyelement, positions timeseries default null, disable_caching book default false) returns setof cs_agg_result |
This is specialized version of cs_project for transposing result of hash_agg_* functions.
They return two timeseries: the first one with values of aggregate and the second one with values of group-by key.
cs_project_agg transforms this result to set of cs_agg_result rows, consisting of two columns: (agg_val float8, group_by bytea) . In case of combining several keys for group-by key, it can be splitted back into separate values using cs_cut or cs_as functions.
Concerning optional disable_caching parameter please read section Projection issues.
|
function cs_map(input timeseries, positions timeseries) returns timeseries |
Extracts from first timeseries elements with positions specified in the second timeseries.
Example of cs_map usage: cs_map('float8:{3.14,0.1,-10}', 'int8:{1,2}')='float8:{-0.1,10}' |
Grand aggregates
Functions calculating grand aggregates: aggregation of all timeseries elements.
Execution of these functions can be parallelized (except cs_median
).
Function | Description |
---|---|
function cs_count(timeseries) returns bigint |
Counts number of timeseries elements. |
function cs_empty(timeseries) returns bool |
Checks if timeseries contains no elements. This function is usually more efficient than cs_count() %lt;%gt; 0 except cases when filter is applied to large timeseries and relatively small number of elements fits filter condition (unlike cs_empty , cs_count can be executed in parallel) |
function cs_approxdc(timeseries) returns bigint |
Approximates number of different timeseries elements. |
function cs_max(timeseries) returns float8 |
Maximal value of timeseries elements. |
function cs_min(timeseries) returns float8 |
Minimum value of timeseries elements |
function cs_avg(timeseries) returns float8 |
Average value of timeseries elements. |
function cs_sum(timeseries) returns float8 |
Sum of timeseries elements. |
function cs_prd(timeseries) returns float8 |
Product of timeseries elements. |
function cs_var(timeseries) returns float8 |
Variation of timeseries elements. |
function cs_dev(timeseries) returns float8 |
Standard deviation of timeseries elements. |
function cs_median(timeseries) returns float8 |
Median element of timeseries. |
Group-by aggregates
Functions calculating aggregates for each group.
Groups are identified by sequence of elements with the same value in group-by
timeseries.
It is not mandatory to sort this timeseries. But you should realize that sequences of the same value in different parts of the timeseries will form different groups. For example, there are four groups in timeseries '{1, 1, 2, 1, 1, 1, 2, 2,}': ('{1, 1}', '{2}', '{1, 1, 1}', '{2, 2}')
. If you want to perform aggregation for all timeseries element with the same value, then use cs_hash_*
functions instead.
Function | Description |
---|---|
function cs_group_count(timeseries) returns timeseries |
Returns number of elements in each group (sequences of repeated values) |
function cs_group_apporaxdc(input timeseries, group_by timeseries) returns timeseries |
Approximates number of distinct values in each group. group_by timeseries identifies groups: sequences of repeated values. |
function cs_group_max(input timeseries, group_by timeseries) returns timeseries |
Maximal value of each group. group_by timeseries identifies groups: sequences of repeated values. |
function cs_group_min(input timeseries, group_by timeseries) returns timeseries |
Minimal value of each group. group_by timeseries identifies groups: sequences of repeated values. |
function cs_group_avg(input timeseries, group_by timeseries) returns timeseries |
Average value of each group. group_by timeseries identifies groups: sequences of repeated values. |
function cs_group_var(input timeseries, group_by timeseries) returns timeseries |
Variation of each group. group_by timeseries identifies groups: sequences of repeated values. |
function cs_group_dev(input timeseries, group_by timeseries) returns timeseries |
Standard deviation of each group. group_by timeseries identifies groups: sequences of repeated values. |
function cs_group_first(input timeseries, group_by timeseries) returns timeseries |
First element of each group. group_by timeseries identifies groups: sequences of repeated values. |
function cs_group_last(input timeseries, group_by timeseries) returns timeseries |
Last element of each group. group_by timeseries identifies groups: sequences of repeated values. |
Grid aggregates
Splitting timeseries into intervals intervaland calculating aggregate for each interval (grid cell).
Function | Description |
---|---|
function cs_grid_max(input timeseries, step integer) returns timeseries |
Maximal value of each interval. Parameter step specifies size of interval. |
function cs_grid_min(input timeseries, step integer) returns timeseries |
Minimal value of each interval. Parameter step specifies size of interval. |
function cs_grid_avg(input timeseries, step integer) returns timeseries |
Average value of each interval. Parameter step specifies size of interval. |
function cs_grid_sum(input timeseries, step integer) returns timeseries |
Sum of each interval. Parameter step specifies size of interval. |
function cs_grid_var(input timeseries, step integer) returns timeseries |
Variation of each interval. Parameter step specifies size of interval. |
function cs_grid_dev(input timeseries, step integer) returns timeseries |
Standard deviation of each interval. Parameter step specifies size of interval. |
Window (moving) aggregates
Aggregation is done for window - N subsequent elements of timeseries where N is window size. At each step window is moved at one position forward. So result timeseries has the same number of elements as input timeseries. First N-1 elements of result are calculated for windows smaller than N. You can use cs_limit(cs_window_AGG(input, N), N-1)
to skip these elements.
Function | Description |
---|---|
function cs_window_max(input timeseries, window_size integer) returns timeseries |
Maximal value of each window. Parameter window_size specifies size of window. |
function cs_window_min(input timeseries, window_size integer) returns timeseries |
Minimal value of each window. Parameter window_size specifies size of window. |
function cs_window_avg(input timeseries, window_size integer) returns timeseries |
Average value of each window. Parameter window_size specifies size of window. |
function cs_window_sum(input timeseries, window_size integer) returns timeseries |
Sum of each window. Parameter window_size specifies size of window. |
function cs_window_var(input timeseries, window_size integer) returns timeseries |
Variation of each window. Parameter window_size specifies size of window. |
function cs_window_dev(input timeseries, window_size integer) returns timeseries |
Standard deviation of each window. Parameter window_size specifies size of window. |
Hash aggregates (group-by using hash function)
Aggregation with group-by. These function perform grouping and aggregation similar to SQL. All elements of group_by
sequence with the same value forms single group. It is done using hash function, so cs_hash_*
aggregates require additional memory for building hash table. These functions contain two out parameters: return two timeseries.
The first one contains calculated aggregates. The second one contains corresponding group keys.
If it is necessary to perform grouping by more than one key, it is possible to use cs_cat
(or ||
SQL operator) to concatenate several columns. Later it is possible to use cs_cut
or cs_as
functions to split concatenated value back into components.
Execution of these functions can be parallelized.
Function | Description |
---|---|
function cs_hash_count(group_by timeseries, out count timeseries, out groups timeseries) returns record |
Counts number of elements having the same value. This function has two output parameters: count timeseries contains counters and groups timeseries contains corresponding group key values. So result of cs_hash_count('float4:{1,3,1,4,2,2}') will be ('int8:{2,2,1,1}', 'float4:{1,2,3,4}') . |
function cs_hash_dup_count(input timeseries, group_by timeseries, out count timeseries, out groups timeseries, min_occurrences integer default 1) returns record |
Counts number of duplicates for each group. Groups are identified by group_by timeseries.
This function has two output parameters: count timeseries contains counters and groups timeseries contains corresponding group key values. Parameter min_occurrences specifies minimal number of occurrences of element in each group. With default value 1 of min_occurrences this function calculates number of distinct values. |
function cs_hash_max(input timeseries, group_by timeseries, out max timeseries, out groups timeseries) returns record |
Calculates maximal value for each group. Groups are identified by group_by timeseries.
This function has two output parameters: max timeseries contains calculated maximums and groups timeseries contains corresponding group key values. |
function cs_hash_min(input timeseries, group_by timeseries, out max timeseries, out groups timeseries) returns record |
Calculates minimal value for each group. Groups are identified by group_by timeseries.
This function has two output parameters: min timeseries contains calculated minimums and groups timeseries contains corresponding group key values. |
function cs_hash_avg(input timeseries, group_by timeseries, out avg timeseries, out groups timeseries) returns record |
Calculates average value for each group. Groups are identified by group_by timeseries.
This function has two output parameters: avg timeseries contains calculated averages and groups timeseries contains corresponding group key values. |
function cs_hash_sum(input timeseries, group_by timeseries, out avg timeseries, out groups timeseries) returns record |
Calculates sum of elements for each group. Groups are identified by group_by timeseries.
This function has two output parameters: sum timeseries contains calculated sums and groups timeseries contains corresponding group key values. |
Cumulative aggregates
Aggregates calculated for all preceding elements of timeseries. Result timeseries has the same number of elements as input timeseries.
Function | Description |
---|---|
function cs_cum_max(timeseries) returns float8 |
Cumulative maximal value of timeseries elements |
function cs_cum_min(timeseries) returns timeseries |
Cumulative minimum value of timeseries elements |
function cs_cum_avg(timeseries) returns timeseries |
Cumulative average value of timeseries elements |
function cs_cum_sum(timeseries) returns timeseries |
Cumulative sum of timeseries elements |
function cs_prd(timeseries) returns timeseries |
Cumulative product of timeseries elements |
function cs_var(timeseries) returns timeseries |
Cumulative variation of timeseries elements |
function cs_dev(timeseries) returns timeseries |
Cumulative standard deviation of timeseries elements |
Sort functions
Top functions find out top-N values of timeseries. N can not be larger than imcs.tile_size
(default value 128).
Execution of cs_top_*
functions can be parallelized. Please notice that calculation of TOP-n
is much faster than full sort.
Function | Description |
---|---|
function cs_top_max(timeseries, top integer) returns timeseries |
Returns top maximal elements in timeseries in descent order.
For example cs_top_max('float4:{1.1,0.1,2.2,0.2}', 2)='float4:{2.2,1.1}' |
function cs_top_min(timeseries, top integer) returns timeseries |
Returns top minimum elements in timeseries in ascent order.
For example cs_top_min('float4:{1.1,0.1,2.2,0.2}', 2)='float4:{0.1,0.2}' |
function cs_top_max_pos(timeseries, top integer) returns timeseries |
Returns positions of top maximal elements in timeseries in descent order.
For example cs_top_max_pos'float4:{1.1,0.1,2.2,0.2}', 2)='int8:{2,0}' |
function cs_top_min_pos(timeseries, top integer) returns timeseries |
Returns positions of top minimum elements in timeseries in ascent order.
For example cs_top_min_pos('float4:{1.1,0.1,2.2,0.2}', 2)='int8:{1,3}' |
function cs_sort(timeseries, sort_order cs_sort_order default 'asc') returns timeseries |
Sorts specified timeseries of scalar element type.
For example cs_sort('float4:{1.1,0.1,2.2,0.2}')='float4:{0.1,0.2,1.1,2.2}' |
function cs_sort_pos(timeseries, sort_order cs_sort_order default 'asc') returns timeseries |
Returns positions of timeseries scalar elements in specified order.
For example cs_sort_pos('float4:{1.1,0.1,2.2,0.2}')='int8:{1,3,0,2}' |
function cs_rank(timeseries, sort_order cs_sort_order default 'asc') returns timeseries |
Returns rank of scalar timeseries elements.
For example cs_rank('float4:{1.1,0.1,2.2,0.2,0.1}')='int8:{4,1,5,3,1}' |
function cs_dense_rank(timeseries, sort_order cs_sort_order default 'asc') returns timeseries |
Returns dense rank of scalar timeseries elements.
For example cs_rank('float4:{1.1,0.1,2.2,0.2,0.1}')='int8:{3,1,4,2,1}' |
function cs_quantile(timeseries, q_num integer) returns timeseries |
Calculates q-quantiles for timeseries of scalar element type.
The quantiles are the data values marking the boundaries between consecutive subsets.
This functions returns timeseries with q_num+1 values of the same type as input timeseries
For example cs_quantile('float4:{10,3,0,3,4,5,9,11,7,3,3}', 2)='float4:{0,4,11}' |
Special functions
This group contains functions performing quite complex precessing of timeseries.
Function | Description | ||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
function cs_histogram(input timeseries, min float8, max float8, n_intervals integer) returns timeseries |
Builds histogram for the input timeseries. Minimal (inclusive) and maximal (exclusive) values for input timeseries should be specified as well as number of interval (histogram columns). Values outside specified range [min_value,max_value) are ignored. Number of intervals should not be greater than tile size. Execution of these functions can be parallelized. |
||||||||||||||||||||||
function cs_cross(input timeseries, first_cross_direction integer) returns timeseries |
Finds positions in input timeseries where it crosses zero, if first_cross_direction is positive then starts with first cross over, if negative then starts with cross below, if zero it doesn't matter (first cross can be over or below)
For example cs_cross('float4:{1,2,-1,0.5,0.6,0.0,0.1,0.3,-5}', 0)='int8:{2,3,7}' |
||||||||||||||||||||||
function cs_extrema(input timeseries, first_extremum integer) returns timeseries |
Finds positions of extrema (local minimum and maximums) in input timeseries, if first_extremum is positive then starts with first local maximum, if negative starts with local minimum, if zero it doesn't matter.
For example cs_extrema('float4:{1,2,3,2,1,0,0,1,1,2,4,0}', 0)='int8:{2,6,10}' |
||||||||||||||||||||||
function cs_stretch(ts1 timeseries, ts2 timeseries, vals timeseries, filler float8 defaults 1.0) returns timeseries |
Stretches vals timeseries to the length of first timeseries. Repeats elements of vals timeseries while corresponding timestamp (timeseries ts2 ) is larger than timestamp from ts1 . For exampe cs_stretch('int4:{1,2,3,4,5}', 'int4:{2,4}', 'float4:{1.1,2.2}', 1.0) = 'float4:{1.1,2.2,2.2,1.0,1.0}' . This function can be used to calculate split adjusted price. We need to revert timeseries of splits, calculate cumulative product, stretch and multiply it on price. |
||||||||||||||||||||||
function cs_stretch0(ts1 timeseries, ts2 timeseries, vals timeseries, filler float8 defaults 0.0) returns timeseries |
Injects missed elements in vals timeseries (associated with ts2 ) so that corresponding timestamps of ts1 and ts2 are matched. For example cs_stretch0('int4:{1,2,3,5}', 'int4:{2,3,4}', 'float4:{1.1,1.2,1.3}', 0.0) = 'float4:{0.0,1.1,1.2,1.3,0.0}' . This function may be useful if we need to perform operations with trading data for different symbols
and this data can contains some holes (no trading for particular symbol for this date) |
||||||||||||||||||||||
function cs_asof_join(ts1 timeseries, ts2 timeseries, vals timeseries) returns timeseries |
Gets values from third timeseries corresponding to the timestamp from ts2 closest to the timestamp from ts1 . For example cs_asof_join('int4:{4,9}', 'int4:{1,3,6,10}', 'float4:{0.1,0.3,0.6,1.0}') = 'float4:{0.3,1.0}' . |
||||||||||||||||||||||
function cs_asof_join_pos(ts1 timeseries, ts2 timeseries) returns timeseries |
Gets positions of elements in sorted timeseries ts2 closest to the elements in sorted timeseriests1 . For example cs_asof_join_pos('int4:{4,9}', 'int4:{1,3,6,10}') = 'int8:{1,3}' . |
||||||||||||||||||||||
function cs_join(ts1 timeseries, ts2 timeseries, vals timeseries) returns timeseries |
Gets elements from vals timeseries which corresponds to elements of sorted timeseries ts2 matching elements of sorted timeseries ts1 . For example cs_join_pos('int4:{0,2,3,8,10}', 'int4:{1,3,6,10}', 'float4:{0.1,0.3,0.6,1.0}') = 'int8:{0.3,1.0}' . |
||||||||||||||||||||||
function cs_join_pos(ts1 timeseries, ts2 timeseries) returns timeseries |
Gets positions of elements in sorted timeseries ts2 matching elements in sorted timeseriests1 . For example cs_join_pos('int4:{0,2,3,8,10}', 'int4:{1,3,6,10}') = 'int8:{1,3}' . |
||||||||||||||||||||||
function cs_cut(str bytea, format cstring) returns record |
Splits binary string into components. This function is reverse to cs_cat which may be needed to construct
combined group-by key for aggregate functions. format string describes types of component. Type is specified by one letter followed by field length. Below is list of supported types:
'i4f4c10' corresponds to a row with one integer, one float and one character component with length 10. |
||||||||||||||||||||||
function cs_as(str bytea, type_name cstring) returns record |
Yet another function splitting binary string into components. This function is reverse to cs_cat which may be needed to construct combined group-by key for aggregate functions. Parameter type_name specifies composite type which components will be fetched from the input string. Below is example of using this function:
create type char16 as (body char(16)); select agg_val,cs_as(group_by,'char16') from (select (cs_project_agg(cs_hash_sum(volenquired,fxvenue))).* from DbItem_get()) agg;In this example group-by consists of just one key of char(16) type.
It is also possible to print it without cs_as :
select agg_val,encode(btrim(group_by, E'\\000'::bytea), 'escape') from (select (cs_project_agg(cs_hash_sum(volenquired,fxvenue))).* from DbItem_get()) agg;Please notice that cs_as functions returns type record and PostgreSQL doesn't allow to convert it to composite type.
So you can not write:
select agg_val,cs_as(group_by,'char16')::char16 from (select (cs_project_agg(cs_hash_sum(volenquired,fxvenue))).* from DbItem_get()) agg;And as far as PostgreSQL has no information about columns, you can not use (cs_as(...)).* clause to extract columns of the row.
But you can create function returning proper type and bind it to cs_as C implementation:
create function to_char16(body bytea, type_name cstring default 'char16') returns char16 as '$libdir/imcs', 'cs_as' language C stable strict; select agg_val,(to_char16(group_by)).* from (select (cs_project_agg(cs_hash_sum(volenquired,fxvenue))).* from DbItem_get()) agg;But please be careful: nobody will check that this function really returns declared type. |
Operators
IMCS provides standard SQL operators for timeseries
type plus some specific operators for timeseries processing.
The following table contains mapping between operators and corresponding timeseries functions:
Operator | IMCS function | Description |
---|---|---|
a + b | cs_add | Adds elements of two timeseries |
a - b | cs_sub | Subtracts elements of two timeseries |
a * b | cs_mul | Multiplies elements of two timeseries |
a / b | cs_div | Divides elements of two timeseries |
a % b | cs_mod | Divides by modulo elements of two timeseries |
a ^ b | cs_pow | Raises to power |
a ~ b | cs_corr | Correlation of two timeseries |
a ? b | cs_filter | Filters elements of right timeseries according to condition specified by left timeseries |
a ? | cs_filter_pos | Returns position of elements for which condition is true |
a +* b | cs_wsum | Weighted sum of two timeseries |
a // b | cs_wavg | Weighted average of two timeseries, for example Volume//Price is volume weighted average price (VWAP) |
a & b | cs_and | Bitwise AND (can be also used for boolean timeseries) |
a | b | cs_or | Bitwise OR (can be also used for boolean timeseries) |
a # b | cs_xor | Bitwise XOR (can be also used for boolean timeseries) |
a || b | cs_cat | Concatenates elements of two timeseries |
a = b | cs_eq | Checks if element if left timeseries is equal to element of right timeseries |
str ~~ pattern | cs_ilike | Finds elements of character timeseries matching specified pattern |
a <> b | cs_ne | Checks if element if left timeseries is not equal to element of right timeseries |
a < b | cs_lt | Checks if element if left timeseries is less than element of right timeseries |
a <= b | cs_le | Checks if element if left timeseries is less than or equal to element of right timeseries |
a > b | cs_gt | Checks if element if left timeseries is greater than element of right timeseries |
a >= b | cs_ge | Checks if element if left timeseries is greater than or equal to element of right timeseries |
a -> b | cs_asof_join_pos | Finds positions of elements in second sorted timeseries closest to the elements of first sorted timeseries |
a <-> b | cs_join_pos | Finds positions of elements in second sorted timeseries equal to the elements of first sorted timeseries |
a << n | cs_cut_head | Skips first n elements of timeseries |
a >> n | cs_cut_tail | Skips last n elements of timeseries |
-a | cs_neg | Negates elements of timeseries |
!a | cs_not | Logical NOT |
~a | cs_bit_not | Bitwise NOT |
@a | cs_abs | Absolute values of timeseries elements |
Please notice that operators & | #
in PostgreSQL have precedence different from precedence of standard SQL AND OR
operators. Please always use parenthesis.
Binary operators accept not only timeseries OP timeseries operands. Also it is possible to pass as right parameter one of
- Numeric constant (integer or floating point):
select Price*2.0 from Quote_get()
- Date, time or timestamp:
select Day=date('11-Nov-2013') from Quote_get()
- String literal (1):
select Close*'{2.0,2.1,2.2}'::text from Quote_get('11-Nov-2013', '13-Nov-2013')
- String literal (2):
select Close*'float4:{2.0,2.1,2.2}' from Quote_get('11-Nov-2013', '13-Nov-2013')
In first two cases constant timeseries (timeseries containing the same value) is implicitly created for right operand using cs_const
function.
In third case timeseries is created from string literal using cs_parse
function.
And in the last case conversion to timeseries
type is implicitly made by PostgreSQL using input function of
this type.
Projection issues
There are several functions in IMCS API returning a row or a set of rows: cs_hash_*, cs_project*, cs_as
.
PostgreSQL provides two ways of decomposition of compound type into columns:
select (foo()).*
select * from foo()
cs_hash_sum
function has two output parameters: sum
and groups
.
Output parameters are actually returned in PostgreSQL as anonymous row. So if we write:
select (cs_project_agg(cs_hash_sum(Close,Day%7))).* from Quote_get('IBM');then PostgreSQL will call function
cs_hash_sum
twice. It means that aggregation will be performed twice:
we will have to do double amount of work. I failed to find a way to make PostgreSQL to avoid these redundant calls.
But this problem is solvable.
First of all it is possible to avoid (...).*
construction and access composite type attributes explicitly:
select (q.p).agg_val,(q.p).group_by from (select cs_project_agg(cs_hash_sum(Close,Day % 7)) p from Quote_get('IBM')) q;In this case PostgreSQL generally will not perform redundant calls. To guarantee that multiple evaluation won't be performed you can use the OFFSET 0 hack or abuse PostgreSQL's failure to optimise across CTE boundaries:
select (q.p).agg_val,(q.p).group_by from (select cs_project_agg(cs_hash_sum(Close,Day % 7)) p from Quote_get('IBM') offset 0) q;But IMCS also tries to provide workaround for
(cs_project(...)).*
construction: cs_project
and cs_project_agg
functions can cache their results, avoiding redundant calculations. Unfortunately there are some
restrictions. For example it is not possible to use cs_project
more than once in one query. You can disable such caching for the particular invocation by setting
disable_caching
optional parameter to false
. Or completely disable caching by setting imcs.project_caching
configuration parameter to true.
And concerning case 2) calling function in FROM list: it is possible if function doesn't depend on other data sources at the same query layer.
For example in IMCS cs_hash_sum
accepts timeseries arguments which are provided by Quote_get
.
So we can write:
select (p).* from Quote_get('IBM') q,cs_project_agg(cs_hash_sum(q.Close,q.Day%7)) p;But it works only in PostgreSQL 9.3 which supports lateral joins. A lateral join enables a subquery in the FROM part of a SELECT to reference columns from preceding items in the FROM list. Also function calls in PostgreSQL 9.3 can now directly reference columns from preceding FROM items, even without the LATERAL keyword. This is why query above correctly works with PostgreSQL 9.3 and higher (function is called only once) and generates
function expression in FROM cannot refer to other relations of same query level
error in previous PostgreSQL versions.
Also using projection function in FROM list allows to specify alias and describe columns:
select (s).* from Quote_get('IBM') q,cs_project(cs_hash_sum(q.Close,q.Day%7)) s(sum float8,group_by integer);Conclusion:
- Avoid using of
(cs_hash_AGG(...)).*
construction to eliminate redundant calculations. - Better use IMCS with PostgreSQL 9.3 and build queries using lateral joins.
Implementation details
Timeseries are stored in shared memory as B-Tree pages. This B-Tree provides fast access to timeseries element by position
(for all types) or by value (only for timestamp). There is separate B-Tree for each timeseries. PostgreSQL hash is used to locate
timeseries by identifier. Hash key includes name of the source table, name of the corresponding field and optionally identifier of timeseries. For example for Quote
table identifier of timeseries may be 'quote-close-IBM'
.
Size of B-Tree pages is determined by "imcs.page_size"
configuration parameter. Default value is 4kb.
IMCS uses RW (read-write) lock to synchronize access to columnar store. It means that multiple read-only queries can be performed concurrently,
but adding or removing timeseries elements is possible only in exclusive mode. Lock is set when timeseries is accessed first time. If imcs.serializable
configuration parameter is true (default), then lock is hold till the end of transaction. Such locking policy provides serializable isolation level for timeseries.
If imcs.serializable
is false, then lock is released at the end of query execution. It corresponds to "read committed" isolation level.
Right now IMCS doesn't provide any compression of timeseries elements. It can be done in the future. But IMCS is first of all oriented on financial data (trading systems). And here duplicates are not so often. Even if value of some stock option is quite durable (variation about few cents per year), small fluctuations of this option during a days normally occur. Our experiments show that RLE encoding cause only degrade of performance in case of standard queries on securities data.
Most of cs_*
functions are not actually performing any executions. Instead of it, they are constructing pipe of operators (or expression tree). A unit of exchanging data between operators is a tile (vector). So operators are performing vector operations to reduce interpretation cost. Size of a tile should be large enough to minimize overhead of organizing work of the pipe.
But it should fit in L1 CPU cache to keep processing speed high. Default size of the tile is 128.
IMCS is able to execute some operations in parallel. Now it is done for grand and hash aggregates, top-N functions (all operators where size of output is smaller than size of input). IMCS maintains pool of threads. Number of threads in the pool can be specified using "imcs.n_threads"
configuration parameter. By default (zero value of this parameters), number of threads is detected automatically based on number of CPUs (cores) in the system.
IMCS clones expression subtree and splits into segments timeseries accessed in the leave nodes of this tree (timeseries stored in columnar store). Number of segments corresponds to number of threads. So each thread is processing its own part of timeseries.
Then results are merged using operator-specific merge function. Merge requires synchronization, so only one thread can perform merge at each moment of time.
Please notice that PostgreSQL is not able to parallelize execution of SQL query. Certainly it is possible to manually split query into several subqueries and execute them concurrently. But it is not trivial and not convenient. The fact that IMCS can overcome this limitation is very important for OLAP queries.
Installation and tuning
As far as IMCS is using PostgreSQL shared memory, it should be loaded via shared_preload_libraries.
Please add '$libdir/imcs'
to shared_preload_libraries
in postgresql.conf
file:
shared_preload_libraries = '$libdir/imcs' # (change requires restart)Size of shared memory used by IMCS can be specified using
imcs.shmem_size
parameter.
At most systems maximal size of System V shared memory is limited by quite small constant. So you may also need to alter system
configuration (please refer to OS manual about how to do it). PostgreSQL 9.3 uses mmap
instead of System V shared memory,
so there should be no problem with system quotas. But there is yet another limitation in Linux: it is not able to create shared memory segment larger
than 256Gb with standard 4Kb pages. And now servers with 1Tb memory is not something very exotic. To utilize all available memory in this case it is possible to create multiple shared memory segments. But PostgreSQL is not able to do it. Another solution is to increase page size. Linux supports
huge pages. Unfortunately PostgreSQL still doesn't provide any way of using huge pages: you need to patch PostgreSQL source: add MAP_HUGETLB
to PG_MMAP_FLAGS
define in sysv_shmem.c:
#define PG_MMAP_FLAGS (MAP_HUGETLB|MAP_SHARED|MAP_ANONYMOUS|MAP_HASSEMAPHORE)IMCS distributive contains smarter patch
sysv_shmem.patch
for PostgreSQL 9.3.1 which sets MAP_HUGETLB flag only if size of shared memory segment is larger than 256Gb and only if MAP_HUGETLB is defined (since Linux 2.6.32).
Below is list of all IMCS configuration parameters:
Parameter name | Description | Default value | Recommendations |
---|---|---|---|
imcs.shmem_size | Size of shared memory (Mb) used by columnar store. | 8*1024 (8Gb) | Make it large enough to fit all data requiring vertical representation. It can not be increased without restart of the server. |
imcs.n_timeseries | Estimation for number of timeseries | 10000 | This value is needed for PostgreSQL hash implementation. Too small value may cause large number of collisions. |
imcs.n_threads | Number of threads in thread pool for concurrent execution of a query | 0 - autodetect number of CPUs | Usually number of threads should be equal to number of physical execution units in the system. Please notice that in case of using hyperthreading number of reported CPUs is twice large than real number of cores. Set this parameter to 1 to disable concurrent execution |
imcs.page_size | Size of B-Tree page size in bytes | 4096 | As far as B-Tree is stored in memory, it is not so critical to use large pages. But small page may increase per-element storage overhead. |
imcs.tile_size | Size of tile or vector that is used to organize vector operations | 128 | The larger tile is, the less influence of interpretation overhead. But best performance can be achieved only if tile fits in CPU L1 cache. Please notice that some operators have two or more parameters, so more than one tile can be calculated at each stage of operator's pipe processing. Also memory may be needed for other purposes, so to reduce probability of cache misses, keep this value reasonably small. |
imcs.substitute_nulls | Substitutes NULLs with 0 while loading data in columnar store | false | By default attempt to insert NULL value will cause an error. When value of this option is set to true , IMCS doesn't report an error and stores zero instead of NULL. |
imcs.autoload | Automatically loads data in columnar store when it is accessed first time by any query | true | Loading data from large table can take substantial amount of time and so increase execution time of the query initiated this load. It can confuse an user which expects this query to complete very fast. In such case explicit load of data after server restart can be more desirable (it can be completed before receiving any user's query). |
imcs.serializable | Hold lock till the end of transaction | true | Such locking policy provides serializable isolation level for columnar store. If this parameter is set to false, then lock is released at the end of query execution. It corresponds to "read committed" isolation level. |
imcs.trace | Trace IMCS commands | false | Sends information about executed IMCS command to client and PostgreSQL server log (NOTICE log level). |
imcs.output_string_limit | Limit for length of timeseries string representation | 1024 | Trying to print result of query returning larger timeseries can cause memory overflow or at least produce a lot of screens of hardly readable text. Setting this limit allows to restrict size of printed timeseries: only part of timeseries elements will be printed and then "..." indicates that timeseries was truncated. Setting this parameter to 0 disables this limitation. |
imcs.project_caching | Cache cs_project results to avoid redundant calculations in (cs_project(...)).* expression. | true | Caching can cause incorrect behavior in some cases: when cs_project is used twice in the same query. In this case disable it: everything should work correctly, may be only with some performance penalty in case of using (cs_project(...)).* construction. Also it is possible to disable caching for each particular cs_project invocation by assigning false to optional disable_caching parameter. Please read more in section Projection issues. |
So steps of using IMCS extension are the following:
- Change PostgreSQL configuration file
postgresql.conf
by adding IMCS to list of preloaded libraries and specifying maximal size of IMCS storage. - Install IMCS extension (read PostgreSQL documentation about installation of extensions).
- Create extension using
create extension imcs
command (you need to have superuser permissions for it). - Generate interface functions using
cs_create
function. - If data is already present in the database, load it in columnar store using
TABLE_load()
function. - You should either enable autoload (
imcs.autoload
configuration property), either manually callTABLE_load()
each time you restart the server.
Performance comparison
Consider the following definition of Quote
table:
create table Quote (Symbol char(10), Day date, Open real, High real, Low real, Close real, Volume integer);Let's populate it with NYSE data for ten years (about 6 million records).
\copy Quote from 'NYSE_2003-2013.csv' with csv header;This load is completed at my system in 2.5 minutes. Now lets create vertical representation of this table:
create extension imcs; select cs_create('Quote', 'Day', 'Symbol'); select Quote_load();Loading data in columnar store takes at my computer just 15 seconds. If we call
cs_create
prior to loading data in Quote table, then
time of importing data from CSV file will increase from 2.5 minutes to 6.5 minutes. It is
because of using trigger to propagate inserts to in-memory columnar store.
Now let's calculate volume-weighted price for IBM for the period from 2010 till 2013:
select cs_wavg(Volume, Close) as VWAP from Quote_get('IBM', date('01-Jan-2010'), date('01-Jan-2013'));Query execution time is 10 milliseconds. Now do the same thing with standard SQL:
select sum(Close*Volume)/sum(Volume) from Quote where Symbol='IBM' and Day between date('01-Jan-2010') and date('01-Jan-2013');It takes 750 milliseconds.
Now let calculate VWAP for all symbols. To simplify it we will first create table containing information about all symbols. Actually this data is usually available and contains much more information than just symbol name. But here we need just symbol name:
create table Securities (Symbol char(10)); insert into Securities select distinct Symbol from Quote; create view SecurityQuotes as select (Quote_get(Symbol)).* from Securities;Now we are ready to execute query:
select Symbol,cs_sum(Close*Volume) / cs_sum(Volume) as VWAP from SecurityQuotes;Time is about 500 milliseconds. Now do the same using standard SQL:
select Symbol,sum(Close*Volume)/sum(Volume) as VWAP from Quote group by Symbol;Result is returned after 2243 milliseconds.
Now let's test filter queries with projection back to horizontal representation. The following query finds all dates for 'ABB' symbol when close price was more than 1% large than open price for the particular quarter:
select (Quote_project(abb.*,cs_filter_pos(Close>Open*1.01))).* from Quote_get('ABB', '01-Jan-2010', '31-Mar-2010') abb;It returns 14 results in 12 milliseconds. Now do the same using SQL:
select * from Quote where Symbol='ABB' and Day between date('01-Jan-2010') and date('31-Mar-2010') and Close>Open*1.01;The same result in 640 milliseconds.
Actually timeseries in this example are not long enough: size of timeseries for each symbol is about 2608 elements. Let's now investigate use case with single large timeseries:
select Quote_drop(); select cs_create('Quote', 'Day'); select Quote_load();Load is completed twice faster than in case of Quote table: 7.5 seconds. Now let's execute VWAP for this timeseries:
select Volume//Close as VWAP from Quote_get();Query is completed at my system in 10 milliseconds. The same query using SQL:
select sum(Close*Volume)/sum(Volume) as VWAP from Quote;One second (one thousand milliseconds). So IMCS query is 100 times faster.
Now perform filter query for large timeseries:
select cs_count((Close>Open*1.1)?) from Quote_get(); select count(*) from Quote where Close>Open*1.1;Ratio of the query execution times is once again 100: 6.274 msec vs. 768.251 msec. Now consider real use case wth one timeseries and large enough records. There are about 10 million records with ~40 columns. Database size is about 5Gb. Queries perform groupping by various combinations of fields and cacluate aggregates for some characterestics. For example:
select trader,desk,office,sum(score*volenquired)/sum(volenquired) from DbItem group by trader,desk,office;Execution of this query takes 320 seconds. IMCS analog:
select agg_val,cs_cut(group_by,'c22c30c10') from (select (cs_project_agg(ss1.*)).* from (select (s1).sum/(s2).sum,(s1).groups from DbItem_get() q, cs_hash_sum(q.score*q.volenquired, q.trader||q.desk||q.office) s1, cs_hash_sum(q.volenquired, q.trader||q.desk||q.office) s2) ss1) ss2;takes ... 144 milliseconds. The ratio is more than 2 thousands times. But this is the result with default PostgreSQL parameters (only "shared_buffers" was increased to hold all database in memory). If we also increase "work_mem" from default 1Mb to 1Gb, then times of the query is reduced to 33 seconds for first execution and 7 seconds for subsequent executions. So summarizing these results: IMCS provides about 5-10 times increase of performance for relatively small timeseries (thousands elements) and 100 times faster speed for large timeseries (millions elements) on standard desktop with quad core processor. For SMP server with larger number of cores this ratio is expected to be even higher.
License
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the Software), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHOR OF THIS SOFTWARE BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Please send any feedbacks, complaints, bug reports, change requests to the author Konstantin Knizhnik. Latest version of this software can be obtained from the site http://www.garret.ru.