Contents
- colcompress AM
- How It Works
- Vectorized GROUP BY Aggregation
- Vectorized Filter Evaluation
- Zone-Map Pruning (Two Layers)
- MergeTree-Like Ordering
- Parallel Scan
- Index-Backed Scan
- DELETE and UPDATE
- ON CONFLICT / Upserts
- Column Cache
- engine.uint8 — Unsigned 64-bit Integer
- Per-Table Options
- Compression Options
- Known Limitations
layout: default title: colcompress
nav_order: 3
colcompress AM
{: .no_toc }
Column-oriented compressed storage with vectorized execution, parallel scan, and MergeTree-like ordering. {: .fs-6 .fw-300 }
Table of contents
{: .text-delta }
1. TOC
{:toc}
How It Works
Data is stored column by column on disk. Each column is split into stripes (default 150,000 rows each), and each stripe is divided into chunk groups (default 10,000 rows). Every chunk records the minimum and maximum value for zone-map pruning.
Table file
├── Stripe 1 (rows 1 – 150,000)
│ ├── Chunk group 0 (rows 1 – 10,000)
│ │ ├── Column A [min, max, compressed values…]
│ │ ├── Column B [min, max, compressed values…]
│ │ └── …
│ └── Chunk group 1 (rows 10,001 – 20,000) …
└── Stripe 2 (rows 150,001 – 300,000) …
A scan only reads the columns referenced by the query, skipping all others. For a wide table where a query touches 3 of 20 columns, I/O is reduced to roughly 15% of a heap scan.
Vectorized GROUP BY Aggregation
storage_engine v2.0 introduces StorageEngineVectorGroupAgg — a custom aggregate executor node that transparently replaces HashAggregate and GroupAggregate for GROUP BY queries over colcompress tables. No SQL changes are required; the planner hook intercepts eligible plans and substitutes the vectorized path automatically.
-- These queries automatically use the vectorized path:
SELECT event_type, COUNT(*), SUM(amount), AVG(price)
FROM events
GROUP BY event_type;
SELECT country_code, MIN(score), MAX(score), COUNT(*)
FROM events
GROUP BY country_code
ORDER BY COUNT(*) DESC;
Supported aggregates: COUNT(*), COUNT(col), SUM, MIN, MAX, AVG
Supported types: int2, int4, int8, float4, float8, numeric, money, engine.uint8
Up to 4 GROUP BY keys are supported. Parallel partial mode runs inside parallel workers via AGGSPLIT_INITIAL_SERIAL, feeding the native finalize step.
-- Control vectorized GROUP BY
SET storage_engine.enable_vectorized_groupagg = on; -- default: on
-- Debug: log when a plan falls back to native aggregation
SET storage_engine.debug_vectorized_groupagg_fallback = on;
Vectorized Filter Evaluation
WHERE clauses are evaluated in column-oriented batches of up to 10,000 values per chunk, eliminating per-row interpreter overhead.
| Category | Supported Types |
|---|---|
Comparisons (=, <>, <, <=, >, >=) |
int2, int4, int8, float4, float8, date, timestamp, timestamptz, char, text, varchar, bool, oid |
Aggregates (count, sum, avg, min, max) |
int2, int4, int8, float8, numeric, date, money, engine.uint8 |
SET storage_engine.enable_vectorization = on; -- default: on
Zone-Map Pruning (Two Layers)
colcompress implements two layers of min/max pruning using statistics stored per chunk in engine.chunk.
Stripe-level (coarse)
Before reading any data, the scan aggregates min/max across all chunks of each stripe and tests against the query’s WHERE predicates. Any stripe whose range is provably disjoint is skipped entirely — no I/O, no decompression.
EXPLAIN (ANALYZE, BUFFERS)
SELECT COUNT(*) FROM events WHERE ts > '2025-01-01';
-- ...
-- Engine Stripes Removed by Pruning: 41
-- Engine Stripes Read: 12
Chunk-level (fine)
Within each surviving stripe, individual chunk groups are tested. Those whose range cannot satisfy the predicate are skipped.
Pruning effectiveness scales with data sortedness. Use orderby + colcompress_merge() to maximize it.
MergeTree-Like Ordering
Inspired by ClickHouse’s MergeTree engine, colcompress supports a global sort key per table.
-- Set sort key
SELECT engine.alter_colcompress_table_set(
'events'::regclass,
orderby => 'ts ASC, user_id ASC'
);
-- After bulk load, compact into globally sorted stripes
SELECT engine.colcompress_merge('events');
-- Now date-range queries skip almost all stripes
SELECT COUNT(*), SUM(amount)
FROM events
WHERE ts BETWEEN '2024-01-01' AND '2024-01-31';
-- Stripes Removed by Pruning: 11
-- Stripes Read: 1
colcompress_merge internally:
1. Copies all live rows to a temporary heap table
2. Truncates the target table
3. Re-inserts rows in orderby order, writing fresh globally-ordered stripes
Parallel Scan
colcompress implements the full PostgreSQL parallel Table AM protocol using Dynamic Shared Memory (DSM). The coordinator pre-loads stripe IDs into DSM; each worker atomically claims the next stripe, decompresses it, and runs its own vectorized evaluation pipeline independently.
SET storage_engine.enable_parallel_execution = on; -- default: on
SET storage_engine.min_parallel_processes = 8; -- minimum workers (default: 8)
-- Standard PostgreSQL parallel knobs also apply
SET max_parallel_workers_per_gather = 8;
{: .warning }
Stripe-level pruning is a sequential-scan optimization. In parallel mode, each worker reads its assigned stripes independently without a global pruning pass. For date-range workloads that depend on pruning, run without parallelism or use a GIN / B-tree index on a non-colcompress table.
Index-Backed Scan
An optional index scan path allows B-tree and other indexes to drive lookups into a colcompress table, decompressing only the matched rows.
-- Enable per-session (for point lookups on analytical tables)
SET storage_engine.enable_engine_index_scan = on;
-- Enable permanently for a specific table (document stores)
SELECT engine.alter_colcompress_table_set('documents'::regclass, index_scan => true);
| Workload | Recommendation |
|---|---|
| Analytics (GROUP BY, range scans, full scan) | Keep off — sequential + pruning wins |
| Document repository (XML, PDF, JSON blobs; fetched by PK) | Set on — columnar compression + point-lookup speed |
{: .important }
Do not create a B-tree index on the
orderbycolumn of an analytical table. The planner will prefer the index, bypassing stripe pruning. Use the GUCenable_engine_index_scan = onat session level for occasional point lookups instead.
DELETE and UPDATE
colcompress fully supports DELETE and UPDATE via a row mask stored in engine.row_mask. Each deleted row is marked as a bit in a per-chunk-group bitmask; the scan engine skips masked rows without rewriting the stripe. UPDATE is implemented as delete-then-insert.
SET storage_engine.enable_dml = on; -- default: on
DELETE FROM events WHERE ts < now() - interval '1 year';
UPDATE events SET amount = amount * 1.1 WHERE event_type = 'purchase';
Deleted rows are reclaimed during VACUUM, which rewrites affected stripes and clears the row mask.
ON CONFLICT / Upserts
Standard INSERT … ON CONFLICT is fully supported:
INSERT INTO events (ts, user_id, event_type, amount)
VALUES (now(), 42, 'purchase', 99.90)
ON CONFLICT (user_id, event_type) DO UPDATE
SET amount = EXCLUDED.amount,
ts = EXCLUDED.ts;
Requires a unique index on the conflict target column(s).
Column Cache
The AM maintains an in-memory column cache that stores decompressed column chunks across executor iterations. Useful for nested loops, repeated plan nodes, or self-joins where the same stripe region is accessed multiple times.
SET storage_engine.enable_column_cache = on; -- default: off
SET storage_engine.column_cache_size = 200; -- MB (default: 200)
engine.uint8 — Unsigned 64-bit Integer
storage_engine ships a native unsigned 64-bit integer type designed for columns that carry values in the full [0, 2⁶⁴−1] range, such as ClickBench’s WatchID and UserID columns.
CREATE TABLE hits (
WatchID engine.uint8,
UserID engine.uint8,
EventTime timestamptz
) USING colcompress;
SET search_path TO engine, public;
SELECT min(WatchID), max(WatchID), sum(WatchID) FROM hits;
- Storage: 8 bytes — identical layout to
bigint, zero overhead - Operators: full unsigned semantics (
18446744073709551615 > 1is true) - Btree + hash opclasses: ORDER BY, GROUP BY, DISTINCT, index scans
- Casts:
uint8 ↔ bigint,uint8 ↔ numeric,uint8 ↔ text - Aggregates (
min,max,sum) in theengineschema —sumreturnsnumericto accommodate values >INT64_MAX - Vectorized:
engine.vmin,engine.vmax,engine.vsumdispatched automatically by the planner
Per-Table Options
SELECT engine.alter_colcompress_table_set(
'events'::regclass,
stripe_row_limit => 150000, -- rows per stripe
chunk_group_row_limit => 10000, -- rows per chunk group
compression => 'zstd',
compression_level => 9,
orderby => 'ts ASC, user_id ASC',
index_scan => false
);
-- Reset individual options to system defaults
SELECT engine.alter_colcompress_table_reset(
'events'::regclass,
compression => true,
compression_level => true
);
-- Inspect
SELECT * FROM engine.colcompress_options WHERE table_name = 'events';
-- Stripe-level view
SELECT * FROM engine.colcompress_stripes WHERE table_name = 'events' LIMIT 5;
Compression Options
| Codec | Description |
|---|---|
pglz |
PostgreSQL built-in LZ (always available) |
lz4 |
Fast (~500 MB/s decompress). Requires liblz4-dev |
zstd ★ |
Best ratio + good speed; level 1–19. Recommended. Requires libzstd-dev |
deflate |
zlib-compatible; good middle ground. Requires libdeflate-dev |
zxc |
Asymmetric SIMD (NEON/AVX2/AVX-512); extremely fast decompress. Ideal for ARM Graviton. |
none |
No compression |
Known Limitations
- No AFTER ROW triggers or foreign keys — use
engine.colcompress_repack()instead ofpg_repack - Stripe pruning disabled in parallel mode — each worker reads its assigned stripes independently
- B-tree index on
orderbycolumn disables pruning — the planner prefers index scan over sequential scan INSERT … SELECTwith pre-existing indexes may corrupt index TIDs — workaround:REINDEX TABLE CONCURRENTLYafter bulk load