DISTINCT That Doesn’t Recount

DISTINCT That Doesn’t Recount

Reference counting for incremental deduplication

SELECT DISTINCT in a materialized view means a full table scan on every refresh. PostgreSQL has to see all the rows to determine which are unique. There’s no shortcut — without knowing the full data set, you can’t know if a row is duplicated.

pg_trickle has a shortcut: reference counting. Each distinct value gets a counter tracking how many source rows produce it. Insert a duplicate? Increment the counter. Delete a row? Decrement the counter. When the counter hits zero, the value is removed from the result.

No scan of existing data. O(delta) per refresh.

The Problem

SELECT DISTINCT region, product_category
FROM orders
JOIN customers ON customers.id = orders.customer_id;

This query deduplicates (region, product_category) pairs. With 50 million orders across 200 unique pairs, a full refresh scans 50 million rows to produce 200 rows.

With IVM, 10 new orders come in. Do they create any new (region, product_category) pairs? Or are all 10 in existing pairs? To answer this without scanning, you need to know how many rows currently produce each pair.

The __pgt_dup_count Column

pg_trickle maintains a hidden column __pgt_dup_count on stream tables that use DISTINCT. This column tracks the multiplicity of each row in the result — how many source rows produce it.

-- What the stream table actually stores (internal representation)
SELECT region, product_category, __pgt_dup_count
FROM pgtrickle.distinct_pairs;

  region    | product_category | __pgt_dup_count
------------+------------------+-----------------
 Northeast  | Electronics      | 12,847
 Northeast  | Clothing         | 8,432
 Southeast  | Electronics      | 15,291
 ...        | ...              | ...

The user-visible query (SELECT * FROM distinct_pairs) hides __pgt_dup_count — you just see the distinct values.

Delta Rules

INSERT a row that matches an existing distinct value: __pgt_dup_count += 1 No new row is added to the result. The value was already there.

INSERT a row with a new distinct value: INSERT row with __pgt_dup_count = 1 New distinct value appears in the result.

DELETE a row: __pgt_dup_count -= 1 If __pgt_dup_count = 0: DELETE the row from the result The distinct value disappears only when all source rows producing it are gone.

UPDATE a row (changes the distinct columns): Decrement old value's __pgt_dup_count (possibly remove it) Increment new value's __pgt_dup_count (possibly insert it)

Example

SELECT pgtrickle.create_stream_table(
  name  => 'active_regions',
  query => $$
    SELECT DISTINCT c.region
    FROM orders o
    JOIN customers c ON c.id = o.customer_id
    WHERE o.status = 'active'
  $$,
  schedule => '5s'
);

Initial state: 5 regions, each with thousands of active orders.

A customer in the “Pacific” region places their first order. Before this, “Pacific” had 0 active orders — it wasn’t in the result. Now it has 1.

pg_trickle: 1. Sees the INSERT in the change buffer. 2. Joins with customers to get region = “Pacific”. 3. Checks: does “Pacific” exist in the result? No → INSERT with __pgt_dup_count = 1.

Later, the only Pacific order is cancelled (status changes to ‘cancelled’, which is filtered out by WHERE status = 'active'). The effective delta is a DELETE of that row.

pg_trickle: 1. Decrements Pacific’s __pgt_dup_count from 1 to 0. 2. Removes “Pacific” from the result.

All other regions are untouched. The refresh processes 1 row, not millions.

DISTINCT ON

PostgreSQL’s DISTINCT ON is a different feature from DISTINCT. It returns one row per group, ordered by a specified column:

SELECT DISTINCT ON (customer_id)
  customer_id, order_id, total, created_at
FROM orders
ORDER BY customer_id, created_at DESC;

This returns the most recent order per customer. It’s a common pattern for “latest row per group.”

pg_trickle handles DISTINCT ON with the same reference-counting approach, but the tie-breaking logic is more complex. The stream table maintains the winning row (based on the ORDER BY) and updates it when:

A new row with a higher sort value is inserted (it becomes the new winner).
The current winner is deleted (the next-best row becomes the winner).

This requires knowing what the “next-best” row is, which in turn requires a lookup against the source data. The cost is O(changed groups) — for each group that was affected by the delta, one query to find the new winner.

DISTINCT with Expressions

DISTINCT can appear with computed columns:

SELECT DISTINCT date_trunc('month', created_at) AS month
FROM events;

The reference counting applies to the computed value, not the raw column. Two events on different days in the same month produce the same distinct value and share a __pgt_dup_count.

Performance

The reference-count approach makes DISTINCT maintenance O(|ΔT|) — proportional to the number of changed rows, not the table size.

Scenario	FULL refresh	DIFFERENTIAL refresh
10 inserts, all in existing groups	~500ms (full scan)	<1ms (10 counter increments)
10 inserts, 2 new groups	~500ms	<1ms (8 increments + 2 inserts)
10 deletes, none empties a group	~500ms	<1ms (10 counter decrements)
10 deletes, 1 group drops to zero	~500ms	<1ms (9 decrements + 1 delete)

The only scenario where DIFFERENTIAL doesn’t help is when the delta touches every group — but that’s rare for DISTINCT queries, which by definition have fewer groups than rows.

When Not to Use DISTINCT in Stream Tables

DISTINCT adds the __pgt_dup_count overhead to every row. If your query naturally produces unique rows (e.g., GROUP BY with a key that guarantees uniqueness), adding DISTINCT is redundant and wasteful.

Check with EXPLAIN:

EXPLAIN SELECT DISTINCT region, SUM(total)
FROM orders GROUP BY region;

If the GROUP BY already produces unique (region) rows, the DISTINCT is a no-op. Remove it — pg_trickle still maintains the stream table correctly, without the reference-counting overhead.

Summary

DISTINCT in stream tables uses reference counting (__pgt_dup_count) to avoid full-scan deduplication. Insert increments, delete decrements, and rows are removed only when the count reaches zero.

The cost is O(delta), not O(table). For the common case — many source rows, few distinct values, small changes per cycle — this is orders of magnitude faster than recomputing.

DISTINCT ON works similarly but with tie-breaking logic. Remove DISTINCT if GROUP BY already ensures uniqueness. And don’t worry about the hidden counter column — it’s invisible to your queries.

PGXN

PostgreSQL Extension Network

Contents