layout: default title: colcompress

colcompress AM

{: .no_toc }

Column-oriented compressed storage with vectorized execution, parallel scan, and MergeTree-like ordering. {: .fs-6 .fw-300 }

Table of contents {: .text-delta } 1. TOC {:toc}


How It Works

Data is stored column by column on disk. Each column is split into stripes (default 150,000 rows each), and each stripe is divided into chunk groups (default 10,000 rows). Every chunk records the minimum and maximum value for zone-map pruning.

Table file
├── Stripe 1  (rows 1 – 150,000)
│   ├── Chunk group 0  (rows 1 – 10,000)
│   │   ├── Column A  [min, max, compressed values…]
│   │   ├── Column B  [min, max, compressed values…]
│   │   └── …
│   └── Chunk group 1  (rows 10,001 – 20,000)  …
└── Stripe 2  (rows 150,001 – 300,000)  …

A scan only reads the columns referenced by the query, skipping all others. For a wide table where a query touches 3 of 20 columns, I/O is reduced to roughly 15% of a heap scan.


Vectorized GROUP BY Aggregation

storage_engine v2.0 introduces StorageEngineVectorGroupAgg — a custom aggregate executor node that transparently replaces HashAggregate and GroupAggregate for GROUP BY queries over colcompress tables. No SQL changes are required; the planner hook intercepts eligible plans and substitutes the vectorized path automatically.

-- These queries automatically use the vectorized path:
SELECT event_type, COUNT(*), SUM(amount), AVG(price)
FROM events
GROUP BY event_type;

SELECT country_code, MIN(score), MAX(score), COUNT(*)
FROM events
GROUP BY country_code
ORDER BY COUNT(*) DESC;

Supported aggregates: COUNT(*), COUNT(col), SUM, MIN, MAX, AVG

Supported types: int2, int4, int8, float4, float8, numeric, money, engine.uint8

Up to 4 GROUP BY keys are supported. Parallel partial mode runs inside parallel workers via AGGSPLIT_INITIAL_SERIAL, feeding the native finalize step.

-- Control vectorized GROUP BY
SET storage_engine.enable_vectorized_groupagg = on;   -- default: on

-- Debug: log when a plan falls back to native aggregation
SET storage_engine.debug_vectorized_groupagg_fallback = on;

Vectorized Filter Evaluation

WHERE clauses are evaluated in column-oriented batches of up to 10,000 values per chunk, eliminating per-row interpreter overhead.

Category Supported Types
Comparisons (=, <>, <, <=, >, >=) int2, int4, int8, float4, float8, date, timestamp, timestamptz, char, text, varchar, bool, oid
Aggregates (count, sum, avg, min, max) int2, int4, int8, float8, numeric, date, money, engine.uint8
SET storage_engine.enable_vectorization = on;  -- default: on

Zone-Map Pruning (Two Layers)

colcompress implements two layers of min/max pruning using statistics stored per chunk in engine.chunk.

Stripe-level (coarse)

Before reading any data, the scan aggregates min/max across all chunks of each stripe and tests against the query’s WHERE predicates. Any stripe whose range is provably disjoint is skipped entirely — no I/O, no decompression.

EXPLAIN (ANALYZE, BUFFERS)
SELECT COUNT(*) FROM events WHERE ts > '2025-01-01';
-- ...
--   Engine Stripes Removed by Pruning: 41
--   Engine Stripes Read: 12

Chunk-level (fine)

Within each surviving stripe, individual chunk groups are tested. Those whose range cannot satisfy the predicate are skipped.

Pruning effectiveness scales with data sortedness. Use orderby + colcompress_merge() to maximize it.


MergeTree-Like Ordering

Inspired by ClickHouse’s MergeTree engine, colcompress supports a global sort key per table.

-- Set sort key
SELECT engine.alter_colcompress_table_set(
    'events'::regclass,
    orderby => 'ts ASC, user_id ASC'
);

-- After bulk load, compact into globally sorted stripes
SELECT engine.colcompress_merge('events');

-- Now date-range queries skip almost all stripes
SELECT COUNT(*), SUM(amount)
FROM events
WHERE ts BETWEEN '2024-01-01' AND '2024-01-31';
-- Stripes Removed by Pruning: 11
-- Stripes Read: 1

colcompress_merge internally: 1. Copies all live rows to a temporary heap table 2. Truncates the target table 3. Re-inserts rows in orderby order, writing fresh globally-ordered stripes


Parallel Scan

colcompress implements the full PostgreSQL parallel Table AM protocol using Dynamic Shared Memory (DSM). The coordinator pre-loads stripe IDs into DSM; each worker atomically claims the next stripe, decompresses it, and runs its own vectorized evaluation pipeline independently.

SET storage_engine.enable_parallel_execution = on;   -- default: on
SET storage_engine.min_parallel_processes = 8;       -- minimum workers (default: 8)

-- Standard PostgreSQL parallel knobs also apply
SET max_parallel_workers_per_gather = 8;

{: .warning }

Stripe-level pruning is a sequential-scan optimization. In parallel mode, each worker reads its assigned stripes independently without a global pruning pass. For date-range workloads that depend on pruning, run without parallelism or use a GIN / B-tree index on a non-colcompress table.


Index-Backed Scan

An optional index scan path allows B-tree and other indexes to drive lookups into a colcompress table, decompressing only the matched rows.

-- Enable per-session (for point lookups on analytical tables)
SET storage_engine.enable_engine_index_scan = on;

-- Enable permanently for a specific table (document stores)
SELECT engine.alter_colcompress_table_set('documents'::regclass, index_scan => true);
Workload Recommendation
Analytics (GROUP BY, range scans, full scan) Keep off — sequential + pruning wins
Document repository (XML, PDF, JSON blobs; fetched by PK) Set on — columnar compression + point-lookup speed

{: .important }

Do not create a B-tree index on the orderby column of an analytical table. The planner will prefer the index, bypassing stripe pruning. Use the GUC enable_engine_index_scan = on at session level for occasional point lookups instead.


DELETE and UPDATE

colcompress fully supports DELETE and UPDATE via a row mask stored in engine.row_mask. Each deleted row is marked as a bit in a per-chunk-group bitmask; the scan engine skips masked rows without rewriting the stripe. UPDATE is implemented as delete-then-insert.

SET storage_engine.enable_dml = on;  -- default: on

DELETE FROM events WHERE ts < now() - interval '1 year';
UPDATE events SET amount = amount * 1.1 WHERE event_type = 'purchase';

Deleted rows are reclaimed during VACUUM, which rewrites affected stripes and clears the row mask.


ON CONFLICT / Upserts

Standard INSERT … ON CONFLICT is fully supported:

INSERT INTO events (ts, user_id, event_type, amount)
VALUES (now(), 42, 'purchase', 99.90)
ON CONFLICT (user_id, event_type) DO UPDATE
    SET amount = EXCLUDED.amount,
        ts     = EXCLUDED.ts;

Requires a unique index on the conflict target column(s).


Column Cache

The AM maintains an in-memory column cache that stores decompressed column chunks across executor iterations. Useful for nested loops, repeated plan nodes, or self-joins where the same stripe region is accessed multiple times.

SET storage_engine.enable_column_cache = on;    -- default: off
SET storage_engine.column_cache_size = 200;     -- MB (default: 200)

engine.uint8 — Unsigned 64-bit Integer

storage_engine ships a native unsigned 64-bit integer type designed for columns that carry values in the full [0, 2⁶⁴−1] range, such as ClickBench’s WatchID and UserID columns.

CREATE TABLE hits (
    WatchID  engine.uint8,
    UserID   engine.uint8,
    EventTime timestamptz
) USING colcompress;

SET search_path TO engine, public;
SELECT min(WatchID), max(WatchID), sum(WatchID) FROM hits;
  • Storage: 8 bytes — identical layout to bigint, zero overhead
  • Operators: full unsigned semantics (18446744073709551615 > 1 is true)
  • Btree + hash opclasses: ORDER BY, GROUP BY, DISTINCT, index scans
  • Casts: uint8 ↔ bigint, uint8 ↔ numeric, uint8 ↔ text
  • Aggregates (min, max, sum) in the engine schema — sum returns numeric to accommodate values > INT64_MAX
  • Vectorized: engine.vmin, engine.vmax, engine.vsum dispatched automatically by the planner

Per-Table Options

SELECT engine.alter_colcompress_table_set(
    'events'::regclass,
    stripe_row_limit      => 150000,   -- rows per stripe
    chunk_group_row_limit => 10000,    -- rows per chunk group
    compression           => 'zstd',
    compression_level     => 9,
    orderby               => 'ts ASC, user_id ASC',
    index_scan            => false
);

-- Reset individual options to system defaults
SELECT engine.alter_colcompress_table_reset(
    'events'::regclass,
    compression       => true,
    compression_level => true
);

-- Inspect
SELECT * FROM engine.colcompress_options WHERE table_name = 'events';

-- Stripe-level view
SELECT * FROM engine.colcompress_stripes WHERE table_name = 'events' LIMIT 5;

Compression Options

Codec Description
pglz PostgreSQL built-in LZ (always available)
lz4 Fast (~500 MB/s decompress). Requires liblz4-dev
zstd Best ratio + good speed; level 1–19. Recommended. Requires libzstd-dev
deflate zlib-compatible; good middle ground. Requires libdeflate-dev
zxc Asymmetric SIMD (NEON/AVX2/AVX-512); extremely fast decompress. Ideal for ARM Graviton.
none No compression

Known Limitations

  • No AFTER ROW triggers or foreign keys — use engine.colcompress_repack() instead of pg_repack
  • Stripe pruning disabled in parallel mode — each worker reads its assigned stripes independently
  • B-tree index on orderby column disables pruning — the planner prefers index scan over sequential scan
  • INSERT … SELECT with pre-existing indexes may corrupt index TIDs — workaround: REINDEX TABLE CONCURRENTLY after bulk load