Contents
PLAN_PERFORMANCE_PART_8.md — Residual Bottlenecks & Next-Wave Optimizations
Current Benchmark Results (2026-02-22)
Full Matrix — Summary (avg ms per cycle)
| Scenario | Rows | Chg % | FULL ms | INCR ms | INCR c1 | INCR 2+ | INCR med | INCR P95 |
|---|---|---|---|---|---|---|---|---|
| scan | 10K | 1% | 32.9 | 1.4 (22.8x) | 1.1 | 1.5 | 1.1 | 2.7 |
| scan | 10K | 10% | 29.2 | 2.6 (11.2x) | 3.8 | 2.5 | 2.4 | 4.4 |
| scan | 10K | 50% | 37.0 | 36.7 (1.0x) | 28.5 | 37.6 | 27.6 | 74.0 |
| scan | 100K | 1% | 300.1 | 4.6 (65.9x) | 1.8 | 4.9 | 1.7 | 9.2 |
| scan | 100K | 10% | 265.2 | 111.5 (2.4x) | 113.0 | 111.3 | 116.3 | 144.6 |
| scan | 100K | 50% | 548.8 | 821.7 (0.7x) | 881.2 | 815.1 | 841.4 | 956.0 |
| filter | 10K | 1% | 21.7 | 4.8 (4.6x) | 2.1 | 5.1 | 4.9 | 9.3 |
| filter | 10K | 10% | 19.4 | 3.4 (5.7x) | 1.6 | 3.6 | 1.8 | 7.6 |
| filter | 10K | 50% | 18.5 | 21.5 (0.9x) | 15.0 | 22.2 | 15.9 | 47.8 |
| filter | 100K | 1% | 134.4 | 3.4 (39.5x) | 1.9 | 3.6 | 1.7 | 9.8 |
| filter | 100K | 10% | 151.7 | 82.5 (1.8x) | 84.4 | 82.3 | 83.0 | 92.0 |
| filter | 100K | 50% | 212.1 | 175.4 (1.2x) | 191.4 | 173.7 | 180.6 | 206.8 |
| aggregate | 10K | 1% | 4.4 | 2.3 (1.9x) | 2.7 | 2.2 | 1.7 | 4.3 |
| aggregate | 10K | 10% | 7.5 | 1.9 (4.0x) | 1.8 | 1.9 | 1.7 | 2.8 |
| aggregate | 10K | 50% | 7.6 | 8.5 (0.9x) | 5.8 | 8.8 | 5.5 | 22.4 |
| aggregate | 100K | 1% | 17.2 | 2.5 (6.8x) | 2.4 | 2.5 | 1.7 | 4.6 |
| aggregate | 100K | 10% | 18.3 | 10.7 (1.7x) | 19.7 | 9.6 | 9.4 | 18.8 |
| aggregate | 100K | 50% | 25.3 | 40.4 (0.6x) | 30.6 | 41.4 | 31.4 | 70.9 |
| join | 10K | 1% | 38.4 | 11.1 (3.5x) | 44.3 | 7.4 | 7.1 | 28.8 |
| join | 10K | 10% | 29.6 | 33.2 (0.9x) | 23.5 | 34.2 | 15.6 | 111.7 |
| join | 10K | 50% | 32.9 | 28.8 (1.1x) | 25.8 | 29.1 | 26.4 | 40.8 |
| join | 100K | 1% | 294.0 | 18.0 (16.3x) | 20.1 | 17.8 | 17.5 | 21.0 |
| join | 100K | 10% | 453.8 | 188.2 (2.4x) | 151.5 | 192.2 | 165.9 | 328.4 |
| join | 100K | 50% | 348.1 | 320.2 (1.1x) | 297.7 | 322.6 | 308.8 | 362.4 |
| join_agg | 10K | 1% | 6.1 | 6.0 (1.0x) | 9.4 | 5.6 | 5.5 | 8.0 |
| join_agg | 10K | 10% | 7.5 | 14.1 (0.5x) | 13.9 | 14.1 | 12.6 | 18.7 |
| join_agg | 10K | 50% | 8.1 | 10.2 (0.8x) | 7.7 | 10.5 | 7.6 | 21.9 |
| join_agg | 100K | 1% | 27.8 | 14.6 (1.9x) | 15.7 | 14.5 | 10.9 | 30.5 |
| join_agg | 100K | 10% | 31.4 | 95.4 (0.3x) | 98.4 | 95.0 | 92.3 | 111.1 |
| join_agg | 100K | 50% | 33.1 | 35.1 (0.9x) | 34.5 | 35.2 | 34.6 | 37.3 |
Per-Phase Timing Breakdown (DIFFERENTIAL avg ms)
| Scenario | Rows | Chg % | Decision | Gen+Build | Merge | Cleanup | Path |
|---|---|---|---|---|---|---|---|
| scan | 10K | 1% | 0.23 | 0.14 | 1.07 | 0.11 | cache_hit |
| scan | 10K | 10% | 0.28 | 0.09 | 2.24 | 0.22 | cache_hit |
| scan | 100K | 1% | 0.52 | 0.05 | 6.46 | 0.47 | cache_hit |
| scan | 100K | 10% | 1.53 | 0.06 | 100.30 | 7.74 | cache_hit |
| filter | 10K | 1% | 0.39 | 0.19 | 1.52 | 0.13 | cache_miss |
| filter | 10K | 10% | 0.38 | 0.12 | 3.19 | 0.28 | cache_hit |
| filter | 100K | 1% | 0.49 | 0.12 | 6.24 | 0.48 | cache_hit |
| filter | 100K | 10% | 1.51 | 0.05 | 74.90 | 3.45 | cache_hit |
| aggregate | 10K | 1% | 0.30 | 0.15 | 0.82 | 0.12 | cache_hit |
| aggregate | 10K | 10% | 0.28 | 0.04 | 0.96 | 0.19 | cache_hit |
| aggregate | 100K | 1% | 0.46 | 0.08 | 1.89 | 0.39 | cache_hit |
| aggregate | 100K | 10% | 0.94 | 0.05 | 5.35 | 1.59 | cache_hit |
| join | 10K | 1% | 0.51 | 0.03 | 5.09 | 1.18 | cache_hit |
| join | 10K | 10% | 0.66 | 0.03 | 11.68 | 1.19 | cache_hit |
| join | 100K | 1% | 0.72 | 0.09 | 12.85 | 1.07 | cache_hit |
| join | 100K | 10% | 1.85 | 0.10 | 168.38 | 4.20 | cache_hit |
| join_agg | 10K | 1% | 0.48 | 0.09 | 2.64 | 0.65 | cache_hit |
| join_agg | 10K | 10% | 0.77 | 0.09 | 8.13 | 1.30 | cache_hit |
| join_agg | 100K | 1% | 0.71 | 0.09 | 6.67 | 0.88 | cache_hit |
| join_agg | 100K | 10% | 1.80 | 0.10 | 79.28 | 4.53 | cache_hit |
No-Data Refresh Latency
- Avg: 9.73 ms (target <10 ms: ✅ PASS, borderline)
- Max: 80.38 ms (cold start)
Criterion Micro-Benchmarks (delta SQL generation, pure Rust)
| Benchmark | Time (µs) |
|---|---|
| diff_scan/3cols | 9.9 |
| diff_scan/10cols | 24.1 |
| diff_scan/20cols | 47.7 |
| diff_filter | 11.3 |
| diff_project | 10.7 |
| diff_aggregate/count_star | 9.3 |
| diff_aggregate/sum_count_avg | 21.7 |
| diff_inner_join | 30.7 |
| diff_left_join | 26.8 |
| diff_distinct | 13.6 |
| diff_union_all/3_children | 18.8 |
| diff_union_all/5_children | 45.3 |
| diff_union_all/10_children | 105.9 |
| diff_window_row_number | 17.1 |
| diff_join_aggregate | 54.2 |
| diff_cte_simple | 16.1 |
| diff_lateral_srf | 13.5 |
Comparison with Previous Checkpoints
Key Scenario: 100K rows, 1% changes
| Checkpoint | scan INCR ms | filter INCR ms | join INCR ms | agg INCR ms | join_agg INCR ms |
|---|---|---|---|---|---|
| Baseline | 572.4 (0.7x) | 126.0 (1.0x) | 336.1 (1.0x) | 22.3 (1.3x) | 45.3 (1.2x) |
| After P1+P2 | 140.3 (2.7x) | 68.8 (2.7x) | 163.9 (2.7x) | 24.3 (1.4x) | 33.9 (1.5x) |
| After P7 | 46.5 (7.0x) | 34.0 (7.9x) | 63.5 (6.0x) | 10.9 (2.6x) | 21.6 (3.1x) |
| After Part 6 | 8.3 (41.7x) | 7.4 (26.3x) | 12.3 (30.4x) | 5.7 (4.4x) | 8.5 (4.9x) |
| Now (Part 8) | 4.6 (65.9x) | 3.4 (39.5x) | 18.0 (16.3x) | 2.5 (6.8x) | 14.6 (1.9x) |
Observations vs Part 6
| Scenario | Part 6 → Now | Notes |
|---|---|---|
| scan 100K/1% | 8.3 → 4.6 (1.8x faster) | Continued improvement, now under 5ms |
| filter 100K/1% | 7.4 → 3.4 (2.2x faster) | Excellent; near-zero overhead |
| agg 100K/1% | 5.7 → 2.5 (2.3x faster) | Big improvement |
| join 100K/1% | 12.3 → 18.0 (0.7x regression) | Regression — investigate |
| join_agg 100K/1% | 8.5 → 14.6 (0.6x regression) | Regression — investigate |
Pipeline Overhead (Decision + Gen+Build)
- Consistently < 1ms for cache hits across all scenarios at 1% change rate
- Gen+Build ≤ 0.19ms — delta SQL generation is negligible
- MERGE dominates: 70–97% of total time in every scenario
Analysis & Bottleneck Identification
B1: Join & Join_Agg Regressions (100K/1%)
Symptom: join 12.3→18.0ms (+46%), join_agg 8.5→14.6ms (+72%).
Probable causes:
- Rescan CTE overhead: The new rescan CTE (added for group-rescan aggregates) adds a LEFT JOIN to the merge CTE. Even for algebraic aggregates (SUM/COUNT), the CTE is structurally present — the planner must still evaluate build_rescan_cte() returning None, but the code should skip the JOIN entirely. Need to verify the generated SQL doesn’t have any residual artifact.
- Docker image changes: The E2E image was rebuilt with all recent code changes. Compilation flags or shared library differences could cause planner behavior changes.
- Benchmark variance: Join P95 = 21.0ms with median 17.5ms. Previous Part 6 numbers had median = 13.2ms but that run used different Docker image builds on potentially different host load conditions. The join 10K/1% cycle-1 spike of 44.3ms (vs 7.4 on 2+) suggests cache warming variance.
Action: Profile the generated delta SQL for join 100K/1% to isolate whether the CTE structure changed, or if PostgreSQL planner behavior shifted.
B2: join_agg Severe Regressions at 10%
Symptom: join_agg 100K/10% → 95.4ms (0.3x FULL). This is the worst scenario in the matrix.
Root cause: At 10%, ~10K change rows feed through the join+aggregate pipeline. The MERGE must: 1. Semi-join delta keys against the base table (10K keys) 2. Re-aggregate N groups 3. MERGE into the stream table
With only 5 groups, 10K changes touching all 5 groups means every group gets re-aggregated. The incremental path does strictly more work than FULL (which is a simple TRUNCATE + INSERT INTO ... SELECT).
Potential fix: Adaptive fallback should detect this and switch to FULL. Verify auto_threshold is triggering correctly for join_agg at 10%.
B3: scan 100K/50% Severe Regression (0.7x)
Symptom: 821.7ms vs 548.8ms FULL. P95 = 956ms.
Root cause: At 50% change rate, 50K rows flow through the delta pipeline. The MERGE INTO with 50K delta rows competing against 100K existing rows is strictly more expensive than TRUNCATE + INSERT INTO ... SELECT with 100K rows (which is a simple sequential scan + load). The adaptive threshold should be catching this.
Action: Lower the adaptive threshold or ensure it is operational for scan queries.
B4: No-Data Latency at Target Boundary (9.73ms)
Symptom: Avg 9.73ms, barely passing the <10ms target. Max = 80.38ms.
Analysis: Previous measurements were ~3ms. The increase likely comes from: - Additional SPI calls in the rescan CTE infrastructure (checking for group-rescan aggregates) - Changed Docker host conditions - Benchmark run variance
Action: Re-run with more iterations. If consistently >5ms, profile the no-data path.
B5: Aggregate P95 Instability (10K/50%)
Symptom: agg 10K/50% median = 5.5ms but P95 = 22.4ms (4x spike).
Root cause: PostgreSQL plan cache invalidation. Every Nth execution triggers re-planning, which costs ~10-15ms. With 50% changes affecting all 5 groups, the merge complexity varies across cycles.
B6: MERGE Dominance — The Fundamental Limit
The per-phase breakdown shows MERGE execution accounts for: - 1ms at 10K/1% (fast — minimal work) - 5-13ms at 100K/1% (moderate) - 74-168ms at 100K/10% (dominates total time)
The pipeline overhead (Decision + Gen+Build) is <1ms — essentially zero. Any further speedup must come from the MERGE SQL itself or from skipping it entirely.
Proposed Optimizations
Phase A: Regression Investigation & Fixes (Priority: HIGH)
A-1: Verify rescan CTE not leaking into algebraic aggregate SQL
Goal: Confirm that diff_aggregate for SUM/COUNT/AVG does NOT add a rescan CTE or LEFT JOIN.
Steps:
1. Add a unit test asserting no agg_rescan appears in diff_aggregate output for SUM/COUNT queries
2. If found, fix build_rescan_cte to return None more efficiently for algebraic-only aggregate lists
Effort: 30 min. Impact: May explain the join_agg 100K/1% regression.
A-2: Profile join 100K/1% delta SQL
Goal: Compare the generated MERGE SQL from Part 6 vs now.
Steps:
1. Add [PGS_DEBUG] SQL logging for join queries
2. Run benchmark, capture the full delta SQL
3. Compare CTE structure with /target/criterion/ historical baselines
4. Check if any new CTEs were introduced
Effort: 1 hour. Impact: Diagnose +46% join regression.
A-3: Audit adaptive threshold for regression scenarios
Goal: Verify auto_threshold triggers FULL fallback for join_agg 10K/10% and scan 100K/50%.
Steps:
1. Add [PGS_PROFILE] logging showing threshold decisions
2. Verify that after 2-3 cycles of INCR being slower, the threshold auto-adjusts
3. If not triggering, lower the default threshold or fix the calculation
Effort: 1 hour. Impact: Should eliminate 0.3x-0.7x scenarios by falling back to FULL.
Phase B: MERGE Optimization (Priority: HIGH)
B-1: Conditional MERGE bypass for no-change groups (aggregates)
Problem: For algebraic aggregates with 5 groups (SUM/COUNT), every delta cycle re-checks all 5 groups even if some have zero changes. The MERGE must read from delta, LEFT JOIN the stream table, compute new values, and write for each group — even when new_count = old_count AND new_sum = old_sum.
Fix: Add a pre-MERGE filter CTE that eliminates unchanged groups before the MERGE:
-- Before MERGE, filter the final CTE to only groups with actual changes
WITH __pgt_changed_groups AS (
SELECT * FROM __pgt_cte_agg_final_N
WHERE __pgt_action IN ('I', 'D') -- already done
)
MERGE INTO st USING __pgt_changed_groups ...
This is already present (the WHERE ... IS DISTINCT FROM guard). The issue may be that PostgreSQL MERGE evaluates all matched rows even when the WHEN clause filters them. Consider splitting to DELETE + INSERT with pre-filtered CTEs instead of MERGE.
Effort: 2 hours. Impact: 10-30% at low change rates where many groups are unchanged.
B-2: Partitioned MERGE for large deltas
Problem: At 100K/10%, the MERGE processes 10K delta rows in a single SQL statement. PostgreSQL’s planner may choose a nested-loop join strategy that scales poorly.
Fix: For delta sizes > 1000 rows, split into batched MERGEs by group key hash or row_id range:
-- Instead of one MERGE with 10K rows:
MERGE INTO st USING (SELECT * FROM delta WHERE __pgt_row_id % 4 = 0) ...
MERGE INTO st USING (SELECT * FROM delta WHERE __pgt_row_id % 4 = 1) ...
-- etc.
Effort: 3 hours. Impact: Speculative — depends on planner behavior. Risk of increased overhead from multiple statements.
Decision: Defer until B-1 and A-3 are evaluated.
B-3: Replace MERGE with DELETE + INSERT for large deltas
Problem: PostgreSQL MERGE has overhead from the WHEN MATCHED/NOT MATCHED branching and tuple visibility checks. For scenarios where most rows change, DELETE changed_rows; INSERT new_rows may be cheaper.
Fix: When delta covers >25% of stream table rows, use:
sql
DELETE FROM st WHERE (key_cols) IN (SELECT key_cols FROM delta);
INSERT INTO st SELECT ... FROM delta WHERE __pgt_action = 'I';
Effort: 2 hours. Impact: Potentially significant at 50% — eliminates MERGE planning overhead. Risk: two statements vs one.
Decision: Implement behind a GUC flag (pg_trickle.merge_strategy = 'auto'|'merge'|'delete_insert').
Phase C: Cleanup & Buffer Optimization (Priority: MEDIUM)
C-1: Async change buffer cleanup
Problem: Cleanup takes 0.5-7.7ms at 100K. Currently synchronous via TRUNCATE.
Fix: Move cleanup to a deferred callback or background worker tick. The change buffer is bounded by LSN range, so stale rows are harmless as long as the next refresh uses the correct frontier.
Effort: 2 hours. Impact: Saves 0.5-7ms per refresh, critical path reduction.
C-2: Trigger write amplification reduction
Problem: At high write rates, every source DML fires a trigger that INSERTs into the change buffer. This means every source INSERT/UPDATE/DELETE generates: - 1 WAL record for the source table - 1 trigger execution - 1 INSERT into the change buffer (+ WAL record + index maintenance)
Fix: Statement-level triggers with transition tables (PostgreSQL AFTER STATEMENT with referencing OLD/NEW TABLE). This batches all changes from a single statement into one buffer INSERT:
CREATE TRIGGER pg_trickle_cdc_tr AFTER INSERT OR UPDATE OR DELETE
ON source_table REFERENCING NEW TABLE AS new_rows OLD TABLE AS old_rows
FOR EACH STATEMENT EXECUTE FUNCTION pg_trickle_cdc_stmt_fn();
Effort: 8 hours (requires CDC rewrite). Impact: 50-80% reduction in trigger overhead at high write volumes. No change for single-row DML.
Decision: Defer to Part 9 — this is a major CDC architecture change.
Phase D: Planner Hints & Stability (Priority: MEDIUM)
D-1: Explicit join strategy hints for MERGE
Problem: P95 spikes in join scenarios (join 10K/10% P95 = 111.7ms vs median = 15.6ms) suggest planner instability.
Fix: Add SET LOCAL hints before MERGE execution:
sql
SET LOCAL enable_nestloop = off; -- for large deltas
SET LOCAL work_mem = '64MB'; -- for hash joins
Apply conditionally based on estimated delta size:
- delta < 100 rows: no hints (let planner optimize for small data)
- delta 100-10K rows: enable_nestloop = off
- delta > 10K rows: enable_nestloop = off + work_mem = 64MB
Effort: 2 hours. Impact: Should reduce P95 variability by 50%+.
D-2: SPI prepared statements
Problem: Every MERGE execution parses the SQL string. With template caching, the SQL is identical across cycles (only LSN placeholders differ), but PostgreSQL still parses it fresh each time.
Fix: Use SPI_prepare() + SPI_execute_plan() for the MERGE statement. Cache the plan handle alongside the SQL template.
Caveat: Previous attempt (H-W2 in Part 4) showed net-negative results because PostgreSQL’s custom plan mode re-plans every EXECUTE for the first 5 executions. Need ≥6 cycles per stream table before the generic plan locks in. With CYCLES=10 this should work.
Effort: 4 hours. Impact: Saves ~1-2ms parse time per refresh. Higher impact at low change rates where parse is a larger fraction of total time.
Decision: Implement and benchmark with CYCLES=20 to amortize plan cache warmup.
Phase E: No-Data Path Optimization (Priority: LOW)
E-1: Ultra-fast empty-buffer check
Problem: No-data latency = 9.73ms. Target is <10ms (barely passing).
Fix: Replace the current count(*) decision query with an EXISTS check that short-circuits on the first row:
sql
SELECT EXISTS(
SELECT 1 FROM pgtrickle_changes.changes_OID
WHERE lsn > $prev_lsn AND lsn <= $new_lsn
LIMIT 1
)
If this returns FALSE, skip all subsequent processing (no Gen, no Build, no MERGE, no Cleanup).
Effort: 1 hour. Impact: Should bring no-data from ~10ms to <3ms.
E-2: Shared memory fast-path for zero-change detection
Problem: Even the EXISTS check requires an SPI call into PostgreSQL.
Fix: Maintain a per-source-table atomic counter in shared memory (via PgAtomic). The CDC trigger increments it on every write. The refresh function reads it — if zero since last refresh, skip SPI entirely.
Effort: 3 hours. Impact: No-data latency → <1ms. Requires shared memory initialization.
Decision: Implement E-1 first; E-2 only if E-1 is insufficient.
Implementation Priority & Schedule
Session 1: Regression Triage (A-1, A-2, A-3) — ✅ COMPLETED
A-1: Rescan CTE not leaking — ✅ CONFIRMED. Added 6 unit tests verifying
no agg_rescan CTE is generated for SUM, COUNT(*), AVG, MIN, MAX, and
SUM+COUNT+AVG combined queries. The build_rescan_cte function correctly
returns None for algebraic-only aggregate lists. The rescan CTE infrastructure
does NOT contribute to the join/join_agg regressions.
A-2: Join regression root cause — ✅ IDENTIFIED. The join 100K/1% regression (12.3ms → 18.0ms) is a pre-existing issue from the PREPARE/EXECUTE revert (P1+P2 commits). The 11-CTE join delta query incurs significant planning overhead when planned fresh every cycle. This is NOT caused by recent changes (rescan CTE, TABLESAMPLE rejection, test fixes). Recovery requires restoring SPI prepared statements (Session 5, D-2).
A-3: Adaptive threshold bug — ✅ FIXED. last_full_ms was never set during
initial materialization, keeping it NULL forever for STs whose change rate
never exceeded the 15% fallback threshold. The auto-tuner code (if let
Some(last_full) = st.last_full_ms) was dead code for these STs. Fixed by
recording the initial materialization time as last_full_ms during
create_stream_table. This enables the auto-tuner from the first differential
refresh onward, which will correctly lower the threshold for scenarios like
join_agg at 10% where INCR is slower than FULL.
Also added debug logging when the auto-tuner adjusts the threshold.
Session 2: No-Data & Cleanup Fast Path (E-1, C-1) — ✅ COMPLETED
E-1: Ultra-fast EXISTS no-data short-circuit
- Replaced heavy LATERAL+capped-count decision query with two-phase approach
- Phase 1: Fast SELECT EXISTS(...) check — single SPI call for single-source;
UNION ALL wrapped in EXISTS() for multi-source (short-circuits on first row)
- Phase 2: Capped-count threshold check only runs when changes actually exist,
with early break on first source exceeding the FULL fallback threshold
- Expected outcome: No-data path avoids pg_class lookup and CASE expression;
should bring no-data latency well under 5ms
C-1: Deferred change buffer cleanup
- Added PendingCleanup struct + PENDING_CLEANUP thread-local queue
- Cleanup is now enqueued (near-zero cost) instead of executing inline
- drain_pending_cleanups() runs at the start of the NEXT refresh cycle
- Safety: LSN-range predicates in delta queries ensure stale rows are never
re-consumed, so deferred cleanup is fully safe
- Profiling label updated from cleanup to cleanup_enqueue to reflect deferral
- Expected outcome: 0.5–7ms savings on every differential refresh
Session 3: Planner Stability (D-1) — ✅ COMPLETED
D-1: Conditional SET LOCAL planner hints based on delta size
- Added apply_planner_hints() function with three tiers:
- delta < 100 rows: no hints (let planner optimise for small data)
- delta 100–9 999: SET LOCAL enable_nestloop = off (favour hash joins)
- delta >= 10 000: also SET LOCAL work_mem = '<N>MB' (avoid disk-spill)
- SET LOCAL is automatically reset at transaction end — no cleanup needed
- Accumulated total_change_count from the capped-count threshold loop to
feed the hint tier decision
- New GUCs:
- pg_trickle.merge_planner_hints (bool, default true) — master switch
- pg_trickle.merge_work_mem_mb (int, default 64) — work_mem for large deltas
- Profiling line now includes delta_est=<N> and hints=<tier> fields
- Expected outcome: P95 reduction for join/join_agg scenarios where
nested-loop plans cause latency spikes
Session 4: MERGE Strategy (B-1, B-3) — ✅ COMPLETED
B-1: IS DISTINCT FROM guard to skip no-op UPDATEs
- Added IS DISTINCT FROM check on the MERGE WHEN MATCHED ... THEN UPDATE
clause so unchanged rows are skipped entirely (no heap write)
- The guard is: AND (st.col1 IS DISTINCT FROM d.col1 OR st.col2 IS DISTINCT FROM d.col2 OR ...)
- Applied in both the prewarm_merge_cache path and the cache-miss build path
- Expected outcome: Eliminates unnecessary I/O for aggregate groups whose
recomputed values are identical to the current values
B-3: DELETE + INSERT alternative for large deltas (behind GUC)
- New GUC: pg_trickle.merge_strategy (string: auto/merge/delete_insert)
- In auto mode (default), switches to DELETE+INSERT when the estimated delta
exceeds 25% of the source table row count (MERGE_STRATEGY_AUTO_THRESHOLD)
- DELETE+INSERT is two statements:
1. DELETE FROM st WHERE __pgt_row_id IN (SELECT __pgt_row_id FROM delta)
2. INSERT INTO st SELECT ... FROM delta WHERE __pgt_action = 'I'
- Both MERGE and DELETE+INSERT templates are cached alongside each other in
CachedMergeTemplate, with strategy selection at execution time
- Profiling line now includes strategy=merge|delete_insert field
- Also accumulated total_table_size from the capped-count threshold loop
to feed the auto-strategy decision
- Expected outcome: 10–30% improvement at 50% change rates where MERGE’s
tuple visibility overhead dominates
Session 5: Prepared Statements (D-2) ✅ COMPLETED
- SQL PREPARE / EXECUTE for MERGE (not C-level SPI_prepare)
- New GUC:
pg_trickle.use_prepared_statements(default true) - Parameterized MERGE template with
$Npositional params for LSN values - PREPARE issued on first cache-hit, EXECUTE on subsequent cycles
- PostgreSQL switches from custom → generic plan after ~5 executions
- DEALLOCATE on cache invalidation; session-level lifetime
- 9 new unit tests (parameterize_lsn_template, build_prepare_type_list, build_execute_params)
- 841 unit tests passing, fmt+lint clean
- Expected outcome: 1-2ms savings per refresh on cache hits
Targets for Part 8
| Metric | Current | Target |
|---|---|---|
| scan 100K/1% | 4.6ms (65.9x) | < 5ms (maintain) |
| filter 100K/1% | 3.4ms (39.5x) | < 4ms (maintain) |
| aggregate 100K/1% | 2.5ms (6.8x) | < 3ms (maintain) |
| join 100K/1% | 18.0ms (16.3x) | < 12ms (recover) |
| join_agg 100K/1% | 14.6ms (1.9x) | < 10ms (recover) |
| join_agg 100K/10% | 95.4ms (0.3x) | > 1.0x (fix) |
| scan 100K/50% | 821.7ms (0.7x) | > 1.0x (fix) |
| No-data latency | 9.73ms | < 5ms |
| P95 / median ratio | up to 7.6x | < 3x |
Summary of Changes Since Part 7
Since the Part 6/7 benchmarks, the following code changes were made that may affect performance:
Rescan CTE (
src/dvm/operators/aggregate.rs): Addedbuild_rescan_cte(),child_to_from_sql(),agg_to_rescan_sql()for group-rescan aggregates. This adds a LEFT JOIN to the merge CTE for group-rescan aggregates (BIT_AND, STDDEV, etc). For algebraic aggregates (SUM/COUNT/AVG),build_rescan_ctereturnsNoneand no JOIN is added. The standard benchmark queries (SUM+COUNT) should not be affected, but needs verification (A-1).TABLESAMPLE rejection (
src/dvm/parser.rs): AddedT_RangeTableSamplecheck. No performance impact.Test changes: Updated E2E tests for PERCENTILE_CONT, STDDEV, EXISTS, window functions, JSON_OBJECT_AGG→JSONB_OBJECT_AGG. No runtime performance impact.
Docker image rebuild: The E2E image was rebuilt with all changes. Different
cargo buildcompilation could produce different binary characteristics.