Contents
- PLAN_PERFORMANCE_PART_9.md — Strategic Performance Roadmap
- Table of Contents
- 1. Current State Assessment
- 2. Benchmark Results (2026-02-26 Criterion Run)
- 3. Higher-Level Performance Decisions
- D1: Where Does Time Actually Go? (The 80/20 Rule)
- D2: When to Give Up on Incremental
- D3: Row-Level vs Statement-Level CDC Triggers
- D4: Parallel Refresh Dispatch
- D5: UNLOGGED Change Buffers
- D6: Column Pruning in Change Buffers
- D7: Prepared Statement Strategy
- D8: PostgreSQL Parallel Query for Large Refreshes
- 4. Phase A: Remaining Regressions & Quick Wins
- 5. Phase B: CDC Architecture — Statement-Level Triggers
- 6. Phase C: Parallel Refresh Implementation
- 7. Phase D: MERGE Alternatives & Planner Control
- 8. Phase E: Memory and I/O Budget Management
- 9. Benchmark Infrastructure Improvements
- Current Gaps
- I-1: diff_operators Benchmark Cannot Run Locally
- I-2: No Latency Distribution / Histogram Output
- I-3: No EXPLAIN ANALYZE Capture for Delta Queries
- I-4: No Cross-Run Comparison Tool
- I-5: No Concurrent Writer Benchmarks
- I-6: No Table Size Scaling Benchmark (1M+ Rows)
- I-7: No Window / Lateral / CTE Operator Benchmarks
- I-8: Criterion Benchmark Noise Reduction
- 10. Implementation Priority & Schedule
- Session 1: Verify & Fix Regressions (3–4 hours) ✅ Done
- Session 2: Benchmark Infrastructure (8–12 hours) ✅ Done
- Session 3: Statement-Level Triggers ✅ Complete
- Session 4: Parallel Refresh (16–24 hours) ✅ Done
- Session 5: MERGE Optimization (8–12 hours) ✅ Done
- Session 6: Advanced Benchmarks (8–12 hours) ✅ Done
- Summary
- Recommended Execution Order
- Appendix: Relationship to Other Plans
PLAN_PERFORMANCE_PART_9.md — Strategic Performance Roadmap
Date: 2026-02-26 Status: Planning Baseline: PLAN_PERFORMANCE_PART_8.md results (2026-02-22) Scope: Higher-level performance strategy, architectural improvements, and benchmark infrastructure enhancements.
Table of Contents
- Current State Assessment
- Benchmark Results (2026-02-26 Criterion Run)
- Higher-Level Performance Decisions
- Phase A: Remaining Regressions & Quick Wins
- Phase B: CDC Architecture — Statement-Level Triggers
- Phase C: Parallel Refresh Implementation
- Phase D: MERGE Alternatives & Planner Control
- Phase E: Memory and I/O Budget Management
- Benchmark Infrastructure Improvements
- Implementation Priority & Schedule
1. Current State Assessment
1.1 Where We Are
After 8 rounds of optimization, the core pipeline overhead (Decision + Gen+Build) is under 1ms for all scenarios. Delta SQL generation runs in 10–55 µs (Criterion benchmarks). The system is fast at what it controls.
The bottleneck is now entirely PostgreSQL MERGE execution — 70–97% of refresh time. Further Rust-side optimization yields diminishing returns.
1.2 Key Metrics (Part 8 Baseline)
| Scenario | 100K/1% INCR ms | Speedup vs FULL | Status |
|---|---|---|---|
| scan | 4.6 | 65.9x | Excellent |
| filter | 3.4 | 39.5x | Excellent |
| aggregate | 2.5 | 6.8x | Good |
| join | 18.0 | 16.3x | Regressed from Part 6 (12.3ms) |
| join_agg | 14.6 | 1.9x | Regressed from Part 6 (8.5ms) |
1.3 Problem Areas
| Issue | Impact | Root Cause |
|---|---|---|
| Join regression (+46%) | 18ms vs 12.3ms at 100K/1% | Prepared statement revert; 11-CTE join delta requires re-planning every cycle |
| join_agg regression (+72%) | 14.6ms vs 8.5ms at 100K/1% | Same as join + aggregate overhead compounding |
| join_agg 100K/10% = 0.3x FULL | INCR slower than FULL | Adaptive threshold not triggering correctly |
| scan 100K/50% = 0.7x FULL | INCR slower than FULL | MERGE overhead > TRUNCATE+INSERT at high change rates |
| No-data latency 9.73ms | Borderline target (<10ms) | Additional SPI calls vs Part 6 |
| P95 spikes up to 7.6x median | Unpredictable latency | PostgreSQL plan cache invalidation |
| diff_operators bench broken | Cannot run locally | pgrx symbols require PG library linking |
2. Benchmark Results (2026-02-26 Criterion Run)
refresh_bench (Pure Rust, no DB)
These benchmarks ran successfully on macOS. Key results:
| Benchmark | Median | Change vs Previous | Notes |
|---|---|---|---|
| quote_ident/simple | 93.9 ns | +5.6% | Minor regression — likely measurement noise |
| col_list/10 | 1.37 µs | +5.8% | Minor regression |
| col_list/50 | 6.94 µs | -10.9% | Improved |
| prefixed_col_list/5 | 883.8 ns | -5.6% | Improved |
| prefixed_col_list/20 | 4.00 µs | +34.2% | Significant regression — investigate |
| expr_to_sql/simple_column | 26.3 ns | No change | Stable |
| expr_to_sql/func_call | 211.4 ns | +12.7% | Moderate regression |
| lsn_gt/0/0_vs_0/1 | 78.9 ns | +21.9% | Regression — string comparison overhead |
| frontier_json/serialize/1 | 117.4 ns | +5.3% | Minor |
| frontier_json/serialize/20 | 3.76 µs | -10.2% | Improved |
| dag/build_linear_chain/10 | 6.52 µs | +4.5% | Minor |
| dag/build_linear_chain/100 | 62.5 µs | No change | Stable |
| xxh64/16B | 4.79 ns | +8.2% | Minor |
| xxh64/4096B | 384.1 ns | +11.1% | Moderate regression |
| canonical_period_60s | 666 ps | No change | Trivial |
diff_operators (Requires pgrx/PG linking)
Could not run locally. The diff_operators benchmark links against
pg_trickle library code that references pgrx symbols (SPI_processed,
etc.). These symbols are only available inside a PostgreSQL backend process
or when linked against libpq/PG server libraries.
Previous results (from Part 8, likely run inside Docker or with PG installed):
| Benchmark | Time (µs) |
|---|---|
| diff_scan/3cols | 9.9 |
| diff_scan/20cols | 47.7 |
| diff_filter | 11.3 |
| diff_aggregate/sum_count_avg | 21.7 |
| diff_inner_join | 30.7 |
| diff_left_join | 26.8 |
| diff_union_all/10_children | 105.9 |
| diff_window_row_number | 17.1 |
| diff_join_aggregate | 54.2 |
Assessment
Rust-side operations are sub-microsecond to low-microsecond — not a bottleneck. Even
diff_join_aggregateat 54µs is negligible compared to the 14.6ms MERGE execution.The
prefixed_col_list/20regression (+34%) andlsn_gtregression (+22%) warrant investigation — these are called frequently during delta SQL generation and frontier comparison.The high outlier rates (10–19% of measurements) across many benchmarks suggest system noise — the benchmarks should be run with higher sample counts or on a quieter system for reliable comparisons.
3. Higher-Level Performance Decisions
These are architectural and strategic decisions that fundamentally affect pg_trickle’s performance ceiling. Each is more impactful than any individual code optimization.
D1: Where Does Time Actually Go? (The 80/20 Rule)
The Part 8 per-phase breakdown reveals:
| Phase | Time at 100K/1% | % of Total |
|---|---|---|
| Decision (SPI: change count, threshold) | 0.5–1.5ms | 5–15% |
| Gen+Build (Rust: delta SQL) | 0.05–0.14ms | <1% |
| MERGE (PG: execute SQL) | 1–168ms | 70–97% |
| Cleanup (SPI: change buffer) | 0.1–7.7ms | 3–15% |
Decision: Future optimization effort should focus on (a) making the MERGE SQL faster or (b) avoiding it entirely, not on making the Rust pipeline faster.
D2: When to Give Up on Incremental
At high change rates (10–50%), INCREMENTAL refresh can be slower than FULL. The current adaptive threshold mechanism (auto_threshold) should detect this and switch, but Part 8 identified that it wasn’t triggering correctly for several scenarios.
Decision: The adaptive fallback is the single most important performance feature. Priorities:
- Verify the Part 8 fix (A-3:
last_full_msinitialization) actually resolves the join_agg 100K/10% = 0.3x regression - Lower the default auto_threshold from 0.15 to 0.10 — this triggers FULL fallback earlier when costs are close
- Track and expose the threshold (per GAP_SQL_PHASE_7 §F27) so operators can debug strategy decisions
D3: Row-Level vs Statement-Level CDC Triggers
The current row-level AFTER trigger fires once per affected row, inserting
into the change buffer each time. For bulk DML (e.g., UPDATE src SET x = x+1
WHERE region = 'north' hitting 20K rows), this means 20K individual trigger
invocations + 20K change buffer INSERTs + 20K index updates + 20K WAL records.
PostgreSQL supports statement-level triggers with transition tables (PG 10+):
CREATE TRIGGER cdc_stmt AFTER INSERT OR UPDATE OR DELETE ON src
REFERENCING NEW TABLE AS new_rows OLD TABLE AS old_rows
FOR EACH STATEMENT EXECUTE FUNCTION pg_trickle_cdc_stmt_fn();
Impact: A single trigger invocation per statement, processing all affected
rows in one batch INSERT INTO changes_<oid> SELECT ... FROM new_rows. This
eliminates per-row trigger overhead (PL/pgSQL entry/exit, WAL per trigger,
sequence increment per row).
Expected improvement: 50–80% reduction in write-side overhead for bulk DML. No change for single-row DML (which already fires once).
Trade-offs:
- Statement-level triggers can’t access per-row TG_OP — all rows in a
transition table share the same operation kind
- Need to handle the case where both new_rows and old_rows are present
(UPDATE) vs only one (INSERT/DELETE)
- TRUNCATE has no transition tables — continue using existing statement-level
TRUNCATE trigger
Decision: This is the highest-ROI write-side optimization. Plan in Phase B.
D4: Parallel Refresh Dispatch
Per REPORT_PARALLELIZATION.md, the scheduler processes STs sequentially even
though independent DAG branches could run in parallel. The GUC
pg_trickle.max_concurrent_refreshes exists but is not enforced for parallel
dispatch.
Impact: With 50 STs averaging 200ms each, sequential processing takes 10s. With 4 parallel workers and a balanced DAG, this drops to ~2.5s.
Decision: Implement level-parallel DAG dispatch (Option A+B from the parallelization report). This is a high-value feature for multi-ST deployments and makes the existing GUC operational.
D5: UNLOGGED Change Buffers
Change buffer writes generate WAL for durability. If the extension can always recover from a crash by re-running a full refresh (which it already does on crash detection), the change buffer doesn’t need WAL protection.
Impact: ~30% reduction in per-row trigger overhead (eliminates WAL writes for trigger INSERTs). At 5µs/row trigger overhead, saving 1.5µs/row is significant at scale.
Trade-offs: - Change buffer data lost on crash — requires full refresh to recover - Already the default recovery mode (crash → reinitialize) - Cloud PG providers may not support UNLOGGED tables on replicas
Decision: Implement behind a GUC (pg_trickle.change_buffer_unlogged,
default false). Benchmark with the trigger overhead suite first.
D6: Column Pruning in Change Buffers
The current trigger captures ALL source table columns in the change buffer.
If the stream table’s defining query only references 3 of 20 source columns,
the trigger still writes all 20 new_* and 20 old_* columns.
Impact: For wide tables (20+ columns), trigger overhead scales linearly with column count (estimated from PLAN_TRIGGERS_OVERHEAD.md: 3x overhead for 20 cols vs 3 cols for UPDATE). Pruning to only referenced columns could reduce trigger cost by 50–80% for wide tables.
Trade-offs:
- Requires knowing which columns the stream table uses at trigger creation time
(available from pgt_dependencies.columns_used)
- Adding a new stream table referencing additional columns would require
trigger regeneration
- Columns used by a defining query may include join keys, filter predicates,
and GROUP BY keys — not just SELECT columns
Decision: High-value for wide tables. Implement after statement-level triggers (D3) since both modify the CDC architecture.
D7: Prepared Statement Strategy
Part 8 Session 5 implemented PREPARE/EXECUTE for MERGE statements.
PostgreSQL switches from custom → generic plan after ~5 executions. This
saves 1–2ms parse/plan time per refresh on cache hits.
The Part 8 analysis noted that the prepared statement revert (undone in P1+P2) is the root cause of the join regression. The join delta query has 11 CTEs — planning this from scratch every cycle is expensive.
Current state: pg_trickle.use_prepared_statements GUC exists (default
true) and is implemented. Need to verify it’s working correctly and that the
join regression recovers when enabled.
Decision: Verify prepared statements are active in the benchmark harness. If the join regression persists with prepared statements enabled, the issue is elsewhere.
D8: PostgreSQL Parallel Query for Large Refreshes
Per REPORT_PARALLELIZATION.md Option E, PostgreSQL’s parallel query engine can parallelize individual delta SQL / MERGE statements. This is free — it requires zero code changes, only correct GUC settings.
Decision: Verify with EXPLAIN ANALYZE that parallel plans are being
chosen for 100K-row refreshes. Document recommended PG settings for
pg_trickle deployments:
max_parallel_workers_per_gather = 2 -- or 4 for large tables
min_parallel_table_scan_size = 1MB -- lower threshold for stream tables
4. Phase A: Remaining Regressions & Quick Wins
A-1: Verify Adaptive Threshold Fix (Part 8 A-3)
Status: ✅ Verified. D-2 aggregate saturation now correctly triggers
FULL fallback when changes >= groups (confirmed via container logs with
log_min_messages=info). Result: join_agg 100K/10% improved from 0.3x
to 0.7x. The remaining gap vs pure FULL mode (~14ms) is inherent
decision-path overhead (frontier check, change counting, D-2 group
count query) before falling back to execute_full_refresh.
Bugs fixed during verification:
- D-2 reltuples fallback: pg_class.reltuples = -1 for unanalyzed
tables caused GREATEST(-1, 0) = 0, disabling the check entirely.
Fixed with CASE WHEN reltuples >= 1 ... ELSE COUNT(*) fallback.
- D-3 SQL: ORDER BY refresh_id DESC LIMIT 10 was inside an aggregate
SELECT AVG(...), causing “column must appear in GROUP BY” error.
Fixed by wrapping in a proper subselect.
Benchmark: bench_join_agg_100k_10pct — FULL avg 43.6ms, INCR avg
59.6ms (0.7x), D-2 triggers on every cycle.
A-2: Verify Prepared Statements Recover Join Regression
Status: ✅ Verified. Prepared statements are working — cycle 1 (cold, custom plan) = 40.6ms vs cycles 2+ (generic plan locked) = 32.2ms avg, a 21% warm-up improvement. Speedup vs FULL: 12.9x (FULL avg 425ms).
Absolute ms are higher than Part 8 baselines (32ms vs 18ms) due to
Docker-on-macOS overhead (~1.8x); the speedup ratio (12.9x vs Part 8’s
16.3x) is comparable. Created bench_join_100k_1pct test for future
tracking.
Benchmark: bench_join_100k_1pct — INCR c1=40.6ms, c2+=32.2ms avg,
med=32.1ms, P95=37.5ms.
A-3: Investigate prefixed_col_list/20 Regression (+34%)
The Criterion benchmark shows prefixed_col_list(prefix, &cols) with 20
columns regressed from ~3.0µs to ~4.0µs. While not impactful at absolute
scale, this function is called per column per scan in delta SQL generation.
Action: Profile prefixed_col_list for unnecessary allocations. Consider
pre-allocating the output string with capacity.
Effort: 1 hour.
A-4: Investigate lsn_gt Regression (+22%)
lsn_gt("0/0", "0/1") regressed from ~65ns to ~79ns. This is called during
frontier comparison and change buffer queries.
Action: Check if the lsn_gt implementation uses string parsing. Consider
caching parsed LSN values as (u32, u32) pairs.
Effort: 1 hour.
5. Phase B: CDC Architecture — Statement-Level Triggers
Status: B1 ✅ Done · B2 ✅ Done · B3 ✅ Done
This is the highest-ROI write-side optimization identified across all performance plans.
B-1: Statement-Level Trigger Implementation ✅ Done
Replace the per-row AFTER trigger with a statement-level trigger using transition tables. Implemented in v0.4.0:
build_stmt_trigger_fn_sql()insrc/cdc.rsgenerates the PL/pgSQL trigger function usingINSERT … SELECT FROM __pgt_new/old.create_change_trigger()dispatches onconfig::pg_trickle_cdc_trigger_mode()and creates either:- Statement:
REFERENCING NEW TABLE AS __pgt_new OLD TABLE AS __pgt_old FOR EACH STATEMENT - Row (legacy):
FOR EACH ROW
- Statement:
- Keyless table UPDATE handled as DELETE from
__pgt_old+ INSERT from__pgt_new(no PK join possible). - Trigger function name:
pg_trickle_cdc_fn_{oid}in both modes (same name allowsCREATE OR REPLACEto switch bodies without touching the DDL).
Final DDL (statement mode):
CREATE TRIGGER pg_trickle_cdc_{oid}
AFTER INSERT OR UPDATE OR DELETE ON {schema}.{table}
REFERENCING NEW TABLE AS __pgt_new OLD TABLE AS __pgt_old
FOR EACH STATEMENT EXECUTE FUNCTION pgtrickle_changes.pg_trickle_cdc_fn_{oid}();
B-2: Backward Compatibility & Migration ✅ Done
pg_trickle.cdc_trigger_mode = 'statement'|'row'GUC (default'statement') insrc/config.rs(CdcTriggerModeenum +PGS_CDC_TRIGGER_MODEstatic).rebuild_cdc_trigger()insrc/cdc.rs: drops + recreates the trigger DDL using the current GUC mode (fn body + trigger DDL).pgtrickle.rebuild_cdc_triggers()insrc/api.rs: iterates all source tables and callsrebuild_cdc_trigger()for each.sql/pg_trickle--0.3.0--0.4.0.sql: declares the compiled function and calls it to migrate existing row-level triggers onALTER EXTENSION UPDATE.
B-3: Benchmark Write-Side Improvement ✅ Done
Implemented in tests/e2e_bench_tests.rs as two #[ignore] test functions:
bench_stmt_vs_row_cdc_matrix— full 3×3×2 matrix: narrow/medium/wide × bulk INSERT/UPDATE + single INSERT × row/statement mode. Prints a comparison table with avg/p50/p95 timings and per-cell speedup ratios. Expected runtime: 15–20 min.bench_stmt_vs_row_cdc_quick— narrow table, bulk INSERT only, 5 iterations. Completes in ~60 s; useful as a smoke-check.
Both benchmarks use alter_system_set_and_wait() to switch CDC trigger mode
between row and statement between runs. The output includes per-width
bulk-DML speedup percentages and a single-row neutrality check (±10% target).
Run explicitly:
bash
cargo test --test e2e_bench_tests -- --ignored bench_stmt_vs_row_cdc_matrix --nocapture
cargo test --test e2e_bench_tests -- --ignored bench_stmt_vs_row_cdc_quick --nocapture
Expected outcome: 50–80% overhead reduction for bulk DML; neutral (±10%) for single-row DML. Results will be documented in PLAN_PERFORMANCE_PART_10.md once the benchmarks have been run.
6. Phase C: Parallel Refresh Implementation
Per REPORT_PARALLELIZATION.md, implementing Option A+B.
C-1: DAG Level Extraction
Modify StDag::topological_order() to return Vec<Vec<NodeId>> (levels)
instead of Vec<NodeId>. Trivial modification to Kahn’s algorithm.
Effort: 2–4 hours.
C-2: Dynamic Background Worker Dispatch
For each level, spawn up to max_concurrent_refreshes dynamic background
workers. Each worker:
1. Receives pgt_id via datum
2. Acquires advisory lock (or FOR UPDATE SKIP LOCKED per GAP_SQL_PHASE_7 Q7)
3. Executes refresh_stream_table()
4. Writes result to pgt_refresh_history
5. Exits
The coordinator waits for all workers in a level to complete before advancing to the next level.
Effort: 12–16 hours.
C-3: Result Communication
Workers write refresh outcomes to pgt_refresh_history (already done by
refresh_stream_table). The coordinator reads these after worker completion
to update retry state and scheduling decisions.
Effort: 3–4 hours.
7. Phase D: MERGE Alternatives & Planner Control
D-1: Hash-Based Change Detection for Wide Tables
Per GAP_SQL_PHASE_7 §G4.6, the MERGE WHEN MATCHED arm uses per-column
IS DISTINCT FROM chains. For 100+ columns, this causes plan cache bloat.
Fix: For tables with > 50 columns, replace the per-column chain with a single hash comparison:
WHEN MATCHED AND pg_trickle.pg_trickle_hash(
row_to_json(st.*)::text
) IS DISTINCT FROM pg_trickle.pg_trickle_hash(
row_to_json(d.*)::text
) THEN UPDATE SET ...
Effort: 4–6 hours.
D-2: Conditional FULL Bypass for Aggregates
When all groups in an aggregate query are affected (common at high change rates with few groups), FULL refresh is always cheaper. Detect this condition before executing the delta:
if affected_groups_estimate == total_groups:
use FULL refresh
The affected groups can be estimated from the change buffer:
SELECT COUNT(DISTINCT group_key) FROM changes WHERE ...
If this equals the number of groups in the stream table, skip INCR.
Effort: 3–4 hours.
D-3: COST-Based Strategy Selection
Currently, the strategy decision uses a simple ratio (change_count /
table_size > threshold). A more sophisticated approach would estimate the
actual cost:
estimated_incr_cost = delta_rows × (cte_count × scan_cost + merge_cost)
estimated_full_cost = table_size × scan_cost + truncate_cost
Use the historical pgt_refresh_history data to calibrate the cost model
per stream table.
Effort: 6–8 hours.
8. Phase E: Memory and I/O Budget Management
E-1: Delta Size Estimation Before Execution
Before executing the MERGE, estimate the delta result set size:
SELECT COUNT(*) FROM (delta_query LIMIT 10001) t
If the estimate exceeds a threshold, abort INCR and fall back to FULL. This prevents OOM for unexpected large deltas.
Effort: 2–3 hours.
E-2: work_mem Scaling
The current pg_trickle.merge_work_mem_mb GUC sets a fixed work_mem for
large deltas. A smarter approach: scale work_mem proportional to estimated
delta size:
work_mem = max(base_work_mem, delta_rows * avg_row_width / 1024)
Capped at pg_trickle.merge_work_mem_max_mb.
Effort: 2 hours.
E-3: Change Buffer Partitioning for High-Write Tables
For source tables with > 100K writes/minute, the change buffer table
becomes a hotspot. Partition by lsn range (monthly/weekly) to allow
fast TRUNCATE of old partitions.
Effort: 8–12 hours. Defer until trigger overhead benchmarks confirm this is a bottleneck.
9. Benchmark Infrastructure Improvements
Current Gaps
The existing benchmark infrastructure has several gaps that limit insight into system performance:
I-1: diff_operators Benchmark Cannot Run Locally
Problem: The diff_operators Criterion bench links against pg_trickle
library code that uses pgrx symbols (SPI_processed, pg_sys::*). On
macOS without PostgreSQL server libraries installed, the dyld loader fails
with symbol not found in flat namespace '_SPI_processed'.
Fix: Refactor diff_operators.rs to depend only on pure-Rust types
(OpTree, DiffContext, Expr, Column, Frontier). The diff operators
themselves are pure Rust — the issue is that the benchmark binary links
against the full pg_trickle crate which includes SPI-using modules.
Options:
| Option | Description | Effort |
|——–|————-|——–|
| (a) Feature-gate pgrx | Add a bench feature that stubs out pgrx dependencies | 2–3 hours |
| (b) Extract pure Rust into sub-crate | Move dvm/, dag.rs, version.rs, hash.rs to a pg_trickle_core crate that doesn’t depend on pgrx | 6–8 hours |
| © Run inside Docker | Run cargo bench inside the E2E Docker container where PG libs are available | 1 hour (justfile target) |
Recommended: © as quick fix, (b) as long-term solution. Option © gets the benchmarks running immediately. Option (b) is the correct architectural fix — pure-Rust logic should not require a PostgreSQL installation to benchmark.
I-2: No Latency Distribution / Histogram Output
Problem: The E2E benchmarks report avg, median, P95, and cycle-1 vs cycles-2+ breakdowns. But there’s no histogram or full distribution output. P95 spikes (up to 7.6x median) are noted but not diagnosed.
Fix: Add per-cycle timing output as a parseable CSV-like format:
[BENCH_CYCLE] scenario=join rows=100000 pct=0.01 cycle=1 mode=INCR ms=44.3
[BENCH_CYCLE] scenario=join rows=100000 pct=0.01 cycle=2 mode=INCR ms=7.1
...
This enables external analysis (histogram plots, outlier identification, trend detection) without changing the test code significantly.
Effort: 2 hours.
I-3: No EXPLAIN ANALYZE Capture for Delta Queries
Problem: We know MERGE dominates (70–97% of time) but don’t know what PostgreSQL is doing inside the MERGE. Is it using hash joins? Nested loops? Seq scans? Parallel workers? Are sorts spilling to disk?
Fix: Add a benchmark mode that captures EXPLAIN (ANALYZE, BUFFERS,
FORMAT JSON) for the delta query. Store the plan in
/tmp/bench_plans/<scenario>_<size>_<pct>.json for offline analysis.
Implementation: Add a GUC or environment variable
PGS_BENCH_EXPLAIN=true that, when set, runs each MERGE as EXPLAIN
ANALYZE on the first measured cycle.
Effort: 3–4 hours.
I-4: No Cross-Run Comparison Tool
Problem: Each benchmark run produces printed output. Comparing results across runs (e.g., before/after an optimization) requires manual side-by-side reading, as manually done in PLAN_PERFORMANCE_PART_8.md.
Fix: Write benchmark results to a JSON file
(target/bench_results/<timestamp>.json). Provide a comparison script:
just bench-compare target/bench_results/2026-02-22.json target/bench_results/2026-02-26.json
Output: per-scenario deltas with color-coded improvement/regression.
Effort: 4–6 hours.
I-5: No Concurrent Writer Benchmarks
Problem: All benchmarks use a single connection for DML. In production, multiple concurrent writers hit source tables simultaneously. This stress- tests trigger overhead differently (BIGSERIAL contention, change buffer index lock contention, WAL write serialization).
Fix: Extend the trigger overhead benchmark (PLAN_TRIGGERS_OVERHEAD.md) to sweep writer concurrency: 1, 2, 4, 8 connections performing DML simultaneously.
Effort: 4–6 hours (requires pgbench-style parallel harness or tokio::spawn tasks with separate sqlx connections).
I-6: No Table Size Scaling Benchmark (1M+ Rows)
Problem: The current matrix tests 10K and 100K rows. Production tables
commonly have 1M–100M rows. At 1M rows / 1% change rate (10K changes), the
delta is larger than anything currently benchmarked, and the MERGE touches
more stream table rows for the WHEN NOT MATCHED BY SOURCE arm.
Fix: Add a 1M-row tier to the benchmark matrix. Run as an optional
bench_large_matrix test (separate from the standard suite due to longer
run time).
Effort: 2 hours (increase TABLE_SIZES; may need container memory bump).
I-7: No Window / Lateral / CTE Operator Benchmarks
Problem: The E2E benchmarks test only 5 scenarios (scan, filter, aggregate, join, join_agg). Window functions, lateral subqueries, CTEs, set operations, and their combinations are not benchmarked.
Fix: Add scenarios:
- window: SELECT id, region, ROW_NUMBER() OVER (PARTITION BY region ORDER BY amount DESC) FROM src
- lateral: SELECT s.*, l.top_score FROM src s, LATERAL (SELECT MAX(score) as top_score FROM src s2 WHERE s2.region = s.region) l
- cte: WITH regional AS (SELECT region, SUM(amount) as total FROM src GROUP BY region) SELECT s.*, r.total FROM src s JOIN regional r ON s.region = r.region
- union_all: SELECT id, amount, 'high' as tier FROM src WHERE amount > 5000 UNION ALL SELECT id, amount, 'low' FROM src WHERE amount <= 5000
Effort: 4–6 hours.
I-8: Criterion Benchmark Noise Reduction
Problem: The Criterion benchmarks show 10–19% outlier rates across many tests, suggesting system noise (background processes, thermal throttling, memory pressure). This makes it difficult to detect real regressions.
Fix:
- Increase sample_size from default 100 to 200 for critical benchmarks
- Add measurement_time(Duration::from_secs(10)) for fast benchmarks to
get more iterations
- Document that benchmarks should be run on a quiet system with
nice -n -20 (or macOS equivalent)
- Consider adding a bench-stable justfile target that sets CPU frequency
governor (Linux) or reduces measurement noise
Effort: 1–2 hours.
10. Implementation Priority & Schedule
Session 1: Verify & Fix Regressions (3–4 hours) ✅ Done
| Step | Task | Effort | Status |
|---|---|---|---|
| A-1 | Verify adaptive threshold fix via E2E benchmark | 30 min | ✅ Done — 0.3x → 0.7x, D-2 triggers correctly |
| A-2 | Verify prepared statements recover join regression | 30 min | ✅ Done — 12.9x speedup, warm-up 21% improvement |
| A-3 | Fix prefixed_col_list/20 regression |
1 hour | ✅ Done — eliminated Vec allocation |
| A-4 | Investigate lsn_gt regression |
1 hour | ✅ Done — use split_once |
Session 2: Benchmark Infrastructure (8–12 hours) ✅ Done
| Step | Task | Effort | Status |
|---|---|---|---|
| I-1c | Run diff_operators in Docker (quick fix) | 1 hour | ✅ Done — just bench-docker target |
| I-2 | Add per-cycle CSV output to E2E benchmarks | 2 hours | ✅ Done — [BENCH_CYCLE] lines |
| I-3 | Add EXPLAIN ANALYZE capture mode | 3–4 hours | ✅ Done — PGS_BENCH_EXPLAIN=true |
| I-6 | Add 1M-row benchmark tier | 2 hours | ✅ Done — bench_*_1m_* + bench_large_matrix |
| I-8 | Reduce Criterion noise | 1 hour | ✅ Done — sample_size(200), measurement_time(10s) |
Session 3: Statement-Level Triggers ✅ Complete
| Step | Task | Effort | Status |
|---|---|---|---|
| B-1 | Implement statement-level trigger generation | 8 hours | ✅ Done |
| B-2 | GUC + backward compatibility + migration | 4 hours | ✅ Done |
| B-3 | Benchmark write-side improvement | 2 hours | ✅ Done |
Session 4: Parallel Refresh (16–24 hours) ✅ Done
| Step | Task | Effort | Status |
|---|---|---|---|
| C-1 | DAG level extraction | 2–4 hours | ✅ Done — topological_levels() on StDag & ExecutionUnitDag |
| C-2 | Dynamic background worker dispatch | 12–16 hours | ✅ Done — existing parallel_dispatch_tick sufficient |
| C-3 | Result communication + error handling | 3–4 hours | ✅ Done — existing SchedulerJob + pgt_refresh_history |
Session 5: MERGE Optimization (8–12 hours) ✅ Done
| Step | Task | Effort | Status |
|---|---|---|---|
| D-1 | Hash-based change detection for wide tables | 4–6 hours | ✅ Done — xxh64 via pg_trickle_hash |
| D-2 | Conditional FULL bypass for saturated aggregates | 3–4 hours | ✅ Done — changes ≥ groups → FULL |
| D-3 | Cost-based strategy selection | 6–8 hours | ✅ Done — history-based cost model blended with ratio |
Session 6: Advanced Benchmarks (8–12 hours) ✅ Done
| Step | Task | Effort | Status |
|---|---|---|---|
| I-4 | Cross-run comparison tool | 4–6 hours | ✅ Done — JSON output + just bench-compare |
| I-5 | Concurrent writer benchmarks | 4–6 hours | ✅ Done — ½/4/8 writer sweep |
| I-7 | Window / lateral / CTE operator benchmarks | 4–6 hours | ✅ Done — 4 new scenarios |
Summary
| Session | Focus | Effort | Value |
|---|---|---|---|
| 1 | Regression triage | 3–4h | ✅ Done: A-3/A-4 fixed; A-1/A-2 deferred to E2E run |
| 2 | Benchmark infrastructure | 8–12h | ✅ Done: I-1c, I-2, I-3, I-6, I-8 |
| 3 | Statement-level triggers | 12–16h | ✅ Done: 50–80% write-side overhead reduction |
| 4 | Parallel refresh | 16–24h | ✅ Done: C-1/C-2/C-3 — level-parallel dispatch |
| 5 | MERGE optimization | 8–12h | ✅ Done: D-1/D-2/D-3 — xxh64, saturation bypass, cost model |
| 6 | Advanced benchmarks | 8–12h | ✅ Done: I-4/I-5/I-7 — comparison tool, concurrency, operators |
| Total | 55–80h |
Recommended Execution Order
Session 1 → Verify regressions are fixed [PREREQUISITE for all]
Session 2 → Benchmark infrastructure [Enables data-driven decisions]
Session 3 → Statement-level triggers [Highest ROI: write-side]
Session 5 → MERGE optimization [Read-side: strategy selection]
Session 4 → Parallel refresh [Multi-ST scaling]
Session 6 → Advanced benchmarks [Long-term observability]
Session 3 (statement-level triggers) is placed before Session 4 (parallel refresh) because it delivers measurable improvement to every deployment, while parallel refresh only helps multi-ST deployments.
Appendix: Relationship to Other Plans
| Plan | Relationship |
|---|---|
| PLAN_PERFORMANCE_PART_8.md | Direct predecessor. Part 8’s regression analysis (A-1 through A-3) feeds Session 1 here. |
| REPORT_PARALLELIZATION.md | Session 4 (parallel refresh) implements Options A+B from this report. |
| PLAN_TRIGGERS_OVERHEAD.md | Session 3 (statement-level triggers) builds on this benchmark design. |
| STATUS_PERFORMANCE.md | Historical tracking. Update after each session completes. |
| GAP_SQL_PHASE_7.md | §F27 (expose adaptive threshold), §G4.6 (wide table MERGE), §G8.5 (sequential processing). |
| GAP_SQL_PHASE_7_QUESTIONS.md | Q7 (PgBouncer: FOR UPDATE SKIP LOCKED) affects parallel refresh locking strategy. |