Contents
- pg_trickle — Performance History
- Benchmark Setup
- Snapshot Timeline
- Optimization Details
- Historical Speedup: INCREMENTAL vs FULL
- Current State Summary (After P7)
- Absolute Timings (Current, 100K Rows, 1% Changes)
- Interpretation
- Running Benchmarks
- Design Decisions & Root Causes
- Original Root Cause (Phase 0)
- MERGE vs Temp Table (Phase A)
- JSONB Serialization Round-Trip (Phases A–B)
- SPI Call Reduction (Phases B–C)
- Why PREPARE/EXECUTE Was Rejected (Phase D)
- Scan Window Function Fast Path (Phase E)
- Aggregate Dedup Elimination (Phase E)
- Join Hash Simplification (Phase E)
- Where Time is Spent (Current)
- Warm-Cycle Performance (INCR cycles 2+, excludes cold-start)
- Full Evolution (scan 100K/1% — flagship scenario)
- Adaptive Threshold
- Part 9 — Regression Fixes & Benchmark Infrastructure (Sessions 1–2)
pg_trickle — Performance History
How has incremental (DIFFERENTIAL) refresh performance improved over the optimization sessions, and what is the current state?
Benchmark Setup
All benchmarks use the same full matrix: 5 scenarios × 2 table sizes × 3 change rates = 30 combinations, run inside isolated PostgreSQL 18.x Docker containers via Testcontainers. Each benchmark applies a realistic DML mix (INSERT/UPDATE/DELETE) per cycle and measures wall-clock refresh time.
| Scenario | Defining Query | Operators Tested |
|---|---|---|
| scan | SELECT id, region, category, amount, score FROM src |
Table scan only |
| filter | SELECT id, region, amount FROM src WHERE amount > 5000 |
Scan + filter |
| aggregate | SELECT region, SUM(amount), COUNT(*) FROM src GROUP BY region |
Scan + group-by aggregate |
| join | SELECT s.id, s.region, s.amount, d.region_name FROM src s JOIN dim d ON ... |
Scan + inner join |
| join_agg | SELECT d.region_name, SUM(s.amount), COUNT(*) FROM src s JOIN dim d ON ... GROUP BY ... |
Scan + join + aggregate |
Table sizes: 10,000 and 100,000 rows. Change rates: 1%, 10%, 50%.
Snapshot Timeline
Four benchmark snapshots were captured across the optimization sessions:
| Snapshot | What Changed |
|---|---|
| Baseline | No optimizations |
| After P1+P2 | Phase A: to_jsonb() trigger serialization, drop unused columns, LSN bounds. Phase B: cached MERGE SQL + cached OpTree/delta template |
| After P6 | + LATERAL VALUES aggregate rewrite, PK resolution, no-change short-circuit, EXISTS early-exit, prepared statements |
| After P7 | + deferred cleanup, warm-cycle tuning |
Optimization Details
Phase A — Quick Wins (P1)
- Fix 3: Replaced
row_to_json()::textwithto_jsonb()in CDC trigger functions for faster serialization. - Fix 4: Dropped unused
old_row_datacolumn from change buffer tables, reducing I/O. - Fix 11: Added upper LSN bound to change buffer queries and cleanup, preventing unbounded scans.
Phase B — Caching (P2)
- Fix 9: Cached compiled MERGE SQL across refresh cycles, eliminating repeated query planning (~30ms saved per refresh).
- Fix 10: Cached parsed OpTree and delta SQL template, avoiding redundant parsing and differentiation (~15ms saved per refresh).
Phase C — Operator Improvements (P3–P6)
- Fix 5: Replaced 4-branch UNION ALL in aggregate final delta CTE with single-pass
LATERAL VALUESexpansion — eliminates 3 redundant scans of the merge CTE. - Fix 6: Primary key columns resolved from
pg_constraintat parse time and stored inOpTree::Scan, preferring real PKs over the non-nullable column heuristic. - No-change short-circuit: When the change buffer is empty, skip MERGE entirely and return in <2ms.
- EXISTS early-exit: Before building the delta query, run a cheap
SELECT EXISTS(...)on the change buffer to detect no-ops. - Prepared statements: Reuse prepared MERGE statements across cycles to avoid repeated planning.
Phase D — Cleanup & Tuning (P7)
- Deferred cleanup: Change buffer cleanup moved to after the MERGE completes, reducing lock contention during the critical path.
- Warm-cycle tuning: Improved steady-state performance for cycles 2+ after initial cold start.
Historical Speedup: INCREMENTAL vs FULL
At 1% Change Rate (Typical Production Workload)
| Scenario | Rows | Baseline | After P1+P2 | After P6 | After P7 (Current) |
|---|---|---|---|---|---|
| scan | 10K | 2.6× | 2.6× | 10.1× | 4.4× |
| scan | 100K | 0.7× (slower!) | 2.7× | 41.7× | 7.0× |
| filter | 10K | 1.9× | 1.7× | 5.4× | 3.3× |
| filter | 100K | 2.5× | 2.7× | 26.3× | 7.9× |
| aggregate | 10K | 1.4× | 1.2× | 1.7× | 1.5× |
| aggregate | 100K | 1.3× | 1.4× | 4.4× | 2.6× |
| join | 10K | 2.3× | 1.6× | 6.7× | 2.9× |
| join | 100K | 2.6× | 2.7× | 30.4× | 6.0× |
| join_agg | 10K | 1.0× | 1.0× | 0.9× | 0.7× |
| join_agg | 100K | 1.3× | 1.5× | 4.9× | 3.1× |
At 10% Change Rate
| Scenario | Rows | Baseline | After P1+P2 | After P6 | After P7 (Current) |
|---|---|---|---|---|---|
| scan | 10K | 2.1× | 2.2× | 6.4× | 5.0× |
| scan | 100K | 0.8× | 1.8× | 4.5× | 3.3× |
| filter | 10K | 1.4× | 2.2× | 3.3× | 1.9× |
| filter | 100K | 1.6× | 1.6× | 3.5× | 3.1× |
| aggregate | 10K | 0.9× | 0.8× | 1.2× | 1.1× |
| aggregate | 100K | 0.6× | 0.6× | 1.1× | 0.8× |
| join | 10K | 1.4× | 1.4× | 4.1× | 1.8× |
| join | 100K | 1.5× | 1.9× | 3.1× | 3.3× |
| join_agg | 10K | 0.8× | 0.7× | 1.8× | 0.9× |
| join_agg | 100K | 0.7× | 0.7× | 0.9× | 0.8× |
At 50% Change Rate
| Scenario | Rows | Baseline | After P1+P2 | After P6 | After P7 (Current) |
|---|---|---|---|---|---|
| scan | 10K | 1.1× | 1.1× | 3.1× | 2.5× |
| scan | 100K | 0.9× | 0.9× | 0.9× | 0.8× |
| filter | 10K | 1.2× | 1.2× | 0.7× | 0.9× |
| filter | 100K | 1.1× | 1.8× | 1.0× | 1.1× |
| aggregate | 10K | 1.2× | 1.7× | 1.6× | 1.0× |
| aggregate | 100K | 0.9× | 0.9× | 0.8× | 0.6× |
| join | 10K | 1.1× | 1.1× | 1.1× | 1.0× |
| join | 100K | 1.1× | 1.1× | 0.9× | 0.9× |
| join_agg | 10K | 1.3× | 1.1× | 0.6× | 1.4× |
| join_agg | 100K | 0.7× | 0.9× | 0.8× | 0.7× |
Current State Summary (After P7)
At 1% Change Rate (Typical Production Workload)
| Scenario | 10K rows | 100K rows |
|---|---|---|
| scan | 4.4× faster | 7.0× faster |
| filter | 3.3× faster | 7.9× faster |
| aggregate | 1.5× faster | 2.6× faster |
| join | 2.9× faster | 6.0× faster |
| join+agg | 0.7× (overhead) | 3.1× faster |
At 10% Change Rate
| Scenario | 10K rows | 100K rows |
|---|---|---|
| scan | 5.0× faster | 3.3× faster |
| filter | 1.9× faster | 3.1× faster |
| aggregate | 1.1× | 0.8× |
| join | 1.8× faster | 3.3× faster |
| join+agg | 0.9× | 0.8× |
At 50% Change Rate
At 50% churn, INCREMENTAL approaches or slightly exceeds FULL cost across most scenarios (0.6–2.5×). This is expected — when half the table changes, the delta is nearly as expensive as a full recompute.
Absolute Timings (Current, 100K Rows, 1% Changes)
| Scenario | FULL avg ms | INCREMENTAL avg ms | Speedup |
|---|---|---|---|
| scan | 326 | 47 | 7.0× |
| filter | 267 | 34 | 7.9× |
| aggregate | 28 | 11 | 2.6× |
| join | 384 | 64 | 6.0× |
| join_agg | 67 | 22 | 3.1× |
Interpretation
What the Numbers Show
Baseline was rough — INCREMENTAL was actually slower than FULL for scan/100K (0.7×) and several other combos were near 1.0×. The overhead of delta computation, MERGE, and change buffering exceeded the cost of just re-running the query.
P1+P2 (caching) fixed the worst regressions. Caching the compiled MERGE SQL and parsed OpTree eliminated ~45ms of repeated planning per cycle. Large-table scan went from 0.7× to 2.7×.
P6 was the breakthrough — no-change short-circuit, EXISTS early-exit, and prepared statements produced the most dramatic gains. At 1% changes on 100K rows:
- scan: 41.7× (346ms → 8.3ms)
- filter: 26.3× (196ms → 7.4ms)
- join: 30.4× (374ms → 12.3ms)
These extreme multipliers reflect that in a 10-cycle run, most cycles have no changes (the DML happens once, then subsequent cycles find empty change buffers and short-circuit in <2ms).
P7 (current) shows somewhat lower peak numbers than P6 due to benchmark variability (different run, cold-start cycle 1 weighing on averages) and the addition of deferred cleanup overhead. The steady-state “cycle 2+” numbers in the P6 detailed output are the most representative of real-world performance.
Theoretical Basis
The larger the table and the smaller the change rate, the bigger the INCREMENTAL win. Differential maintenance is O(Δ) while full refresh is O(N), so the ratio improves as N/Δ grows. At 100K rows / 1% changes the system processes ~1,000 changed rows instead of re-scanning 100,000, yielding 3–8× speedups in practice (and up to 30–40× when most cycles are no-ops).
When INCREMENTAL Doesn’t Help
- 50% change rate on large tables: The delta is nearly as big as the table itself.
- Small aggregates (join_agg at 10K): The FULL refresh is already ~12ms; the overhead of delta computation, MERGE template, and cleanup exceeds the savings.
- Aggregate 100K at 10–50%: Group-rescan aggregates require reading the stream table to reconstruct counts, which at high change rates approaches full-table cost.
Running Benchmarks
# Build the E2E test image first
./tests/build_e2e_image.sh
# Quick spot check
cargo test --test e2e_bench_tests --features pg18 -- --ignored --nocapture bench_scan_10k_1pct
# Full 30-combination matrix (~15–30 minutes)
cargo test --test e2e_bench_tests --features pg18 -- --ignored --nocapture bench_full_matrix
See BENCHMARK.md for full details on the benchmark harness.
Design Decisions & Root Causes
This section preserves the key technical decisions and root-cause findings from the performance analysis phases (PLAN_PERFORMANCE through PLAN_PERFORMANCE_PART_7). These files were deleted after extracting the information below.
Original Root Cause (Phase 0)
The initial benchmark showed INCREMENTAL was slower than FULL for scan/100K (0.7×). Root cause: execute_manual_refresh() ignored the refresh_mode parameter and always performed TRUNCATE + INSERT — effectively comparing FULL vs FULL with extra overhead. This was the single most impactful bug discovered.
MERGE vs Temp Table (Phase A)
The original incremental apply used a 3-step SPI sequence: create temp table, DELETE matching rows, INSERT new rows. This was replaced with a single PostgreSQL MERGE statement that handles INSERT/UPDATE/DELETE in one operation. Result: 3 SPI calls → 1, ~30% less I/O, atomic application.
JSONB Serialization Round-Trip (Phases A–B)
JSONB serialization was identified as the dominant bottleneck — not query planning as initially suspected. The full path: row → JSON text → binary JSONB → store → read → typed row. Two key fixes:
- Trigger serialization: Replaced
row_to_json()::jsonbwithto_jsonb(NEW)— eliminates the text→binary conversion step. - Row reconstruction: Replaced per-column
->>'col'extraction withjsonb_populate_record()— a single function call reconstructs the entire row instead of N separate field extractions.
SPI Call Reduction (Phases B–C)
The no-op refresh path originally executed 7+ SPI round-trips (13+ through the scheduler) even with zero pending changes. Key eliminations:
- Redundant
SELECT count(*): A full sequential scan of every change buffer was purely informational and served no purpose in the trigger-based CDC model. Replaced withSELECT EXISTS(... LIMIT 1). - Combined SPI queries:
reltuples+ capped-count queries merged viaLATERAL.store_frontier+complete_refreshmerged into single UPDATE. - No-op latency: 86ms → 14.8ms → sub-2ms.
Why PREPARE/EXECUTE Was Rejected (Phase D)
Prepared statements (PREPARE + EXECUTE) were implemented and then reverted. Root cause analysis:
- Custom plans (first ≤5 executions) have the same planning cost as direct
SPI_execute(), plus the overhead of looking up the prepared statement and binding arguments. No benefit for short-cycle benchmarks (CYCLES ≤ 5). - Generic plans (≥6 executions) skip re-planning but use wrong selectivity for LSN range predicates — without concrete LSN values, the planner uses average-case cardinality estimates, leading to wrong sort strategies for GROUP BY and join methods.
- String substitution with concrete parameter values provides correct plans cheaply on every call.
Scan Window Function Fast Path (Phase E)
At 1% changes, ~95%+ of PKs have exactly one change per refresh cycle. The expensive FIRST_VALUE/LAST_VALUE window functions (requiring full partition sort) are unnecessary for these single-change PKs. A 4-CTE split pipeline (pk_stats → single → multi_raw → scan_raw) routes single-change PKs through a direct emit path, skipping window functions entirely. Impact: scan 100K/1% went from 46.5ms → 8.3ms (5.6×).
Aggregate Dedup Elimination (Phase E)
Aggregates originally emitted D+I (delete+insert) row pairs per updated group, requiring a DISTINCT ON sort in the MERGE wrapper. A CASE-based single-row emit marks each group with the correct action directly ('I' for new groups, 'D' for deleted groups, 'I' for updated groups), making aggregate output inherently deduplicated. Eliminates the O(n log n) sort.
Join Hash Simplification (Phase E)
Original join row IDs used nested hash calls: pg_trickle_hash_multi(ARRAY[dl.__pgt_row_id::TEXT, pg_trickle_hash_multi(ARRAY[r."id"::TEXT])::TEXT]) — 2 Rust FFI crossings + 3 TEXT casts per row. Flattened to a single pg_trickle_hash_multi(ARRAY[...]) call, reducing from 2 hash calls to 1 per row.
Where Time is Spent (Current)
With pipeline overhead reduced to sub-2ms on cache hits, MERGE execution dominates all scenarios (69–93% of total refresh time). The remaining performance is determined by PostgreSQL’s execution of the delta CTE chain and MERGE statement — not by the extension’s orchestration layer.
| Phase | 100K/1% | 100K/10% |
|---|---|---|
| Decision | <1ms | <2ms |
| Gen+Build | <1ms | <1ms |
| MERGE | 3–8ms | 13–98ms |
| Cleanup | <1ms | 3–5ms |
Warm-Cycle Performance (INCR cycles 2+, excludes cold-start)
| Scenario | 10K/1% | 100K/1% | 100K/10% |
|---|---|---|---|
| scan | 6.9ms | 12.8ms | 107.8ms |
| filter | 7.8ms | 17.0ms | 54.3ms |
| join | 10.0ms | 29.9ms | 111.7ms |
| aggregate | 6.0ms | 8.3ms | 40.7ms |
| join_agg | 17.5ms | 18.5ms | 52.9ms |
Full Evolution (scan 100K/1% — flagship scenario)
| Phase | INCR ms | INCR/FULL Ratio | Key Change |
|---|---|---|---|
| Baseline | 572.4 | 0.7× | Bug: manual refresh ignored mode |
| After P1+P2 | ~135 | 2.7× | Caching + JSONB fixes |
| After P3 | ~135 | 2.5× | SPI reduction (count→EXISTS) |
| After P5 (P7) | 46.5 | 7.0× | LATERAL VALUES, deferred cleanup |
| After P6 (Part 6) | 8.3 | 41.7× | Scan fast path, join hash, aggregate dedup |
Adaptive Threshold
The system automatically falls back from INCREMENTAL to FULL refresh when the change rate exceeds a configurable threshold (default: 15%). This prevents the pathological case where delta processing on large change sets is slower than full recomputation. The threshold can be tuned per stream table via the pg_trickle.differential_max_change_ratio GUC.
Part 9 — Regression Fixes & Benchmark Infrastructure (Sessions 1–2)
Date: 2026-03-15
Rust-Side Micro-Optimizations
| Function | Before | After | Change | Root Cause |
|---|---|---|---|---|
prefixed_col_list/20 |
~4.0 µs | Expected ~2.8 µs | ~30% improvement | Eliminated intermediate Vec allocation; stream directly to String |
lsn_gt("0/0", "0/1") |
~79 ns | Expected ~55 ns | ~30% improvement | Replaced split('/').collect::<Vec<_>>() with split_once('/') |
col_list/10 |
~1.37 µs | Expected ~1.1 µs | ~20% improvement | Same optimization as prefixed_col_list |
Benchmark Infrastructure Improvements
| ID | Improvement | Impact |
|---|---|---|
| I-1c | just bench-docker target |
Run Criterion inside Docker when local pg_stub fails |
| I-2 | [BENCH_CYCLE] parseable output |
Enables histogram analysis, trend detection |
| I-3 | PGS_BENCH_EXPLAIN=true |
Captures EXPLAIN (ANALYZE, BUFFERS) for delta queries |
| I-6 | 1M-row benchmark tier | Tests production-scale tables (10K changes at 1%) |
| I-8 | sample_size(200), measurement_time(10s) |
Reduces outlier rate from 10–19% to expected <5% |