PLAN_PERFORMANCE_PART_9.md — Strategic Performance Roadmap

PLAN_PERFORMANCE_PART_9.md — Strategic Performance Roadmap

Date: 2026-02-26 Status: Planning Baseline: PLAN_PERFORMANCE_PART_8.md results (2026-02-22) Scope: Higher-level performance strategy, architectural improvements, and benchmark infrastructure enhancements.

Current State Assessment
Benchmark Results (2026-02-26 Criterion Run)
Higher-Level Performance Decisions
Phase A: Remaining Regressions & Quick Wins
Phase B: CDC Architecture — Statement-Level Triggers
Phase C: Parallel Refresh Implementation
Phase D: MERGE Alternatives & Planner Control
Phase E: Memory and I/O Budget Management
Benchmark Infrastructure Improvements
Implementation Priority & Schedule

1. Current State Assessment

1.1 Where We Are

After 8 rounds of optimization, the core pipeline overhead (Decision + Gen+Build) is under 1ms for all scenarios. Delta SQL generation runs in 10–55 µs (Criterion benchmarks). The system is fast at what it controls.

The bottleneck is now entirely PostgreSQL MERGE execution — 70–97% of refresh time. Further Rust-side optimization yields diminishing returns.

1.2 Key Metrics (Part 8 Baseline)

Scenario	100K/1% INCR ms	Speedup vs FULL	Status
scan	4.6	65.9x	Excellent
filter	3.4	39.5x	Excellent
aggregate	2.5	6.8x	Good
join	18.0	16.3x	Regressed from Part 6 (12.3ms)
join_agg	14.6	1.9x	Regressed from Part 6 (8.5ms)

1.3 Problem Areas

Issue	Impact	Root Cause
Join regression (+46%)	18ms vs 12.3ms at 100K/1%	Prepared statement revert; 11-CTE join delta requires re-planning every cycle
join_agg regression (+72%)	14.6ms vs 8.5ms at 100K/1%	Same as join + aggregate overhead compounding
join_agg 100K/10% = 0.3x FULL	INCR slower than FULL	Adaptive threshold not triggering correctly
scan 100K/50% = 0.7x FULL	INCR slower than FULL	MERGE overhead > TRUNCATE+INSERT at high change rates
No-data latency 9.73ms	Borderline target (<10ms)	Additional SPI calls vs Part 6
P95 spikes up to 7.6x median	Unpredictable latency	PostgreSQL plan cache invalidation
diff_operators bench broken	Cannot run locally	pgrx symbols require PG library linking

2. Benchmark Results (2026-02-26 Criterion Run)

refresh_bench (Pure Rust, no DB)

These benchmarks ran successfully on macOS. Key results:

Benchmark	Median	Change vs Previous	Notes
quote_ident/simple	93.9 ns	+5.6%	Minor regression — likely measurement noise
col_list/10	1.37 µs	+5.8%	Minor regression
col_list/50	6.94 µs	-10.9%	Improved
prefixed_col_list/5	883.8 ns	-5.6%	Improved
prefixed_col_list/20	4.00 µs	+34.2%	Significant regression — investigate
expr_to_sql/simple_column	26.3 ns	No change	Stable
expr_to_sql/func_call	211.4 ns	+12.7%	Moderate regression
lsn_gt/0/0_vs_0/1	78.9 ns	+21.9%	Regression — string comparison overhead
frontier_json/serialize/1	117.4 ns	+5.3%	Minor
frontier_json/serialize/20	3.76 µs	-10.2%	Improved
dag/build_linear_chain/10	6.52 µs	+4.5%	Minor
dag/build_linear_chain/100	62.5 µs	No change	Stable
xxh64/16B	4.79 ns	+8.2%	Minor
xxh64/4096B	384.1 ns	+11.1%	Moderate regression
canonical_period_60s	666 ps	No change	Trivial

diff_operators (Requires pgrx/PG linking)

Could not run locally. The diff_operators benchmark links against pg_trickle library code that references pgrx symbols (SPI_processed, etc.). These symbols are only available inside a PostgreSQL backend process or when linked against libpq/PG server libraries.

Previous results (from Part 8, likely run inside Docker or with PG installed):

Benchmark	Time (µs)
diff_scan/3cols	9.9
diff_scan/20cols	47.7
diff_filter	11.3
diff_aggregate/sum_count_avg	21.7
diff_inner_join	30.7
diff_left_join	26.8
diff_union_all/10_children	105.9
diff_window_row_number	17.1
diff_join_aggregate	54.2

Assessment

Rust-side operations are sub-microsecond to low-microsecond — not a bottleneck. Even diff_join_aggregate at 54µs is negligible compared to the 14.6ms MERGE execution.
The prefixed_col_list/20 regression (+34%) and lsn_gt regression (+22%) warrant investigation — these are called frequently during delta SQL generation and frontier comparison.
The high outlier rates (10–19% of measurements) across many benchmarks suggest system noise — the benchmarks should be run with higher sample counts or on a quieter system for reliable comparisons.

3. Higher-Level Performance Decisions

These are architectural and strategic decisions that fundamentally affect pg_trickle’s performance ceiling. Each is more impactful than any individual code optimization.

D1: Where Does Time Actually Go? (The 80/20 Rule)

The Part 8 per-phase breakdown reveals:

Phase	Time at 100K/1%	% of Total
Decision (SPI: change count, threshold)	0.5–1.5ms	5–15%
Gen+Build (Rust: delta SQL)	0.05–0.14ms	<1%
MERGE (PG: execute SQL)	1–168ms	70–97%
Cleanup (SPI: change buffer)	0.1–7.7ms	3–15%

Decision: Future optimization effort should focus on (a) making the MERGE SQL faster or (b) avoiding it entirely, not on making the Rust pipeline faster.

D2: When to Give Up on Incremental

At high change rates (10–50%), INCREMENTAL refresh can be slower than FULL. The current adaptive threshold mechanism (auto_threshold) should detect this and switch, but Part 8 identified that it wasn’t triggering correctly for several scenarios.

Decision: The adaptive fallback is the single most important performance feature. Priorities:

Verify the Part 8 fix (A-3: last_full_ms initialization) actually resolves the join_agg 100K/10% = 0.3x regression
Lower the default auto_threshold from 0.15 to 0.10 — this triggers FULL fallback earlier when costs are close
Track and expose the threshold (per GAP_SQL_PHASE_7 §F27) so operators can debug strategy decisions

D3: Row-Level vs Statement-Level CDC Triggers

The current row-level AFTER trigger fires once per affected row, inserting into the change buffer each time. For bulk DML (e.g., UPDATE src SET x = x+1 WHERE region = 'north' hitting 20K rows), this means 20K individual trigger invocations + 20K change buffer INSERTs + 20K index updates + 20K WAL records.

PostgreSQL supports statement-level triggers with transition tables (PG 10+):

CREATE TRIGGER cdc_stmt AFTER INSERT OR UPDATE OR DELETE ON src
REFERENCING NEW TABLE AS new_rows OLD TABLE AS old_rows
FOR EACH STATEMENT EXECUTE FUNCTION pg_trickle_cdc_stmt_fn();

Impact: A single trigger invocation per statement, processing all affected rows in one batch INSERT INTO changes_<oid> SELECT ... FROM new_rows. This eliminates per-row trigger overhead (PL/pgSQL entry/exit, WAL per trigger, sequence increment per row).

Expected improvement: 50–80% reduction in write-side overhead for bulk DML. No change for single-row DML (which already fires once).

Trade-offs: - Statement-level triggers can’t access per-row TG_OP — all rows in a transition table share the same operation kind - Need to handle the case where both new_rows and old_rows are present (UPDATE) vs only one (INSERT/DELETE) - TRUNCATE has no transition tables — continue using existing statement-level TRUNCATE trigger

Decision: This is the highest-ROI write-side optimization. Plan in Phase B.

D4: Parallel Refresh Dispatch

Per REPORT_PARALLELIZATION.md, the scheduler processes STs sequentially even though independent DAG branches could run in parallel. The GUC pg_trickle.max_concurrent_refreshes exists but is not enforced for parallel dispatch.

Impact: With 50 STs averaging 200ms each, sequential processing takes 10s. With 4 parallel workers and a balanced DAG, this drops to ~2.5s.

Decision: Implement level-parallel DAG dispatch (Option A+B from the parallelization report). This is a high-value feature for multi-ST deployments and makes the existing GUC operational.

D5: UNLOGGED Change Buffers

Change buffer writes generate WAL for durability. If the extension can always recover from a crash by re-running a full refresh (which it already does on crash detection), the change buffer doesn’t need WAL protection.

Impact: ~30% reduction in per-row trigger overhead (eliminates WAL writes for trigger INSERTs). At 5µs/row trigger overhead, saving 1.5µs/row is significant at scale.

Trade-offs: - Change buffer data lost on crash — requires full refresh to recover - Already the default recovery mode (crash → reinitialize) - Cloud PG providers may not support UNLOGGED tables on replicas

Decision: Implement behind a GUC (pg_trickle.change_buffer_unlogged, default false). Benchmark with the trigger overhead suite first.

D6: Column Pruning in Change Buffers

The current trigger captures ALL source table columns in the change buffer. If the stream table’s defining query only references 3 of 20 source columns, the trigger still writes all 20 new_* and 20 old_* columns.

Impact: For wide tables (20+ columns), trigger overhead scales linearly with column count (estimated from PLAN_TRIGGERS_OVERHEAD.md: 3x overhead for 20 cols vs 3 cols for UPDATE). Pruning to only referenced columns could reduce trigger cost by 50–80% for wide tables.

Trade-offs: - Requires knowing which columns the stream table uses at trigger creation time (available from pgt_dependencies.columns_used) - Adding a new stream table referencing additional columns would require trigger regeneration - Columns used by a defining query may include join keys, filter predicates, and GROUP BY keys — not just SELECT columns

Decision: High-value for wide tables. Implement after statement-level triggers (D3) since both modify the CDC architecture.

D7: Prepared Statement Strategy

Part 8 Session 5 implemented PREPARE/EXECUTE for MERGE statements. PostgreSQL switches from custom → generic plan after ~5 executions. This saves 1–2ms parse/plan time per refresh on cache hits.

The Part 8 analysis noted that the prepared statement revert (undone in P1+P2) is the root cause of the join regression. The join delta query has 11 CTEs — planning this from scratch every cycle is expensive.

Current state: pg_trickle.use_prepared_statements GUC exists (default true) and is implemented. Need to verify it’s working correctly and that the join regression recovers when enabled.

Decision: Verify prepared statements are active in the benchmark harness. If the join regression persists with prepared statements enabled, the issue is elsewhere.

D8: PostgreSQL Parallel Query for Large Refreshes

Per REPORT_PARALLELIZATION.md Option E, PostgreSQL’s parallel query engine can parallelize individual delta SQL / MERGE statements. This is free — it requires zero code changes, only correct GUC settings.

Decision: Verify with EXPLAIN ANALYZE that parallel plans are being chosen for 100K-row refreshes. Document recommended PG settings for pg_trickle deployments:

max_parallel_workers_per_gather = 2  -- or 4 for large tables
min_parallel_table_scan_size = 1MB   -- lower threshold for stream tables

4. Phase A: Remaining Regressions & Quick Wins

A-1: Verify Adaptive Threshold Fix (Part 8 A-3)

Status: ✅ Verified. D-2 aggregate saturation now correctly triggers FULL fallback when changes >= groups (confirmed via container logs with log_min_messages=info). Result: join_agg 100K/10% improved from 0.3x to 0.7x. The remaining gap vs pure FULL mode (~14ms) is inherent decision-path overhead (frontier check, change counting, D-2 group count query) before falling back to execute_full_refresh.

Bugs fixed during verification: - D-2 reltuples fallback: pg_class.reltuples = -1 for unanalyzed tables caused GREATEST(-1, 0) = 0, disabling the check entirely. Fixed with CASE WHEN reltuples >= 1 ... ELSE COUNT(*) fallback. - D-3 SQL: ORDER BY refresh_id DESC LIMIT 10 was inside an aggregate SELECT AVG(...), causing “column must appear in GROUP BY” error. Fixed by wrapping in a proper subselect.

Benchmark: bench_join_agg_100k_10pct — FULL avg 43.6ms, INCR avg 59.6ms (0.7x), D-2 triggers on every cycle.

A-2: Verify Prepared Statements Recover Join Regression

Status: ✅ Verified. Prepared statements are working — cycle 1 (cold, custom plan) = 40.6ms vs cycles 2+ (generic plan locked) = 32.2ms avg, a 21% warm-up improvement. Speedup vs FULL: 12.9x (FULL avg 425ms).

Absolute ms are higher than Part 8 baselines (32ms vs 18ms) due to Docker-on-macOS overhead (~1.8x); the speedup ratio (12.9x vs Part 8’s 16.3x) is comparable. Created bench_join_100k_1pct test for future tracking.

Benchmark: bench_join_100k_1pct — INCR c1=40.6ms, c2+=32.2ms avg, med=32.1ms, P95=37.5ms.

A-3: Investigate prefixed_col_list/20 Regression (+34%)

The Criterion benchmark shows prefixed_col_list(prefix, &cols) with 20 columns regressed from ~3.0µs to ~4.0µs. While not impactful at absolute scale, this function is called per column per scan in delta SQL generation.

Action: Profile prefixed_col_list for unnecessary allocations. Consider pre-allocating the output string with capacity.

Effort: 1 hour.

A-4: Investigate lsn_gt Regression (+22%)

lsn_gt("0/0", "0/1") regressed from ~65ns to ~79ns. This is called during frontier comparison and change buffer queries.

Action: Check if the lsn_gt implementation uses string parsing. Consider caching parsed LSN values as (u32, u32) pairs.

Effort: 1 hour.

5. Phase B: CDC Architecture — Statement-Level Triggers

Status: B1 ✅ Done · B2 ✅ Done · B3 ✅ Done

This is the highest-ROI write-side optimization identified across all performance plans.

B-1: Statement-Level Trigger Implementation ✅ Done

Replace the per-row AFTER trigger with a statement-level trigger using transition tables. Implemented in v0.4.0:

build_stmt_trigger_fn_sql() in src/cdc.rs generates the PL/pgSQL trigger function using INSERT … SELECT FROM __pgt_new/old.
create_change_trigger() dispatches on config::pg_trickle_cdc_trigger_mode() and creates either:
- Statement: REFERENCING NEW TABLE AS __pgt_new OLD TABLE AS __pgt_old FOR EACH STATEMENT
- Row (legacy): FOR EACH ROW
Keyless table UPDATE handled as DELETE from __pgt_old + INSERT from __pgt_new (no PK join possible).
Trigger function name: pg_trickle_cdc_fn_{oid} in both modes (same name allows CREATE OR REPLACE to switch bodies without touching the DDL).

Final DDL (statement mode):

CREATE TRIGGER pg_trickle_cdc_{oid}
AFTER INSERT OR UPDATE OR DELETE ON {schema}.{table}
REFERENCING NEW TABLE AS __pgt_new OLD TABLE AS __pgt_old
FOR EACH STATEMENT EXECUTE FUNCTION pgtrickle_changes.pg_trickle_cdc_fn_{oid}();

B-2: Backward Compatibility & Migration ✅ Done

pg_trickle.cdc_trigger_mode = 'statement'|'row' GUC (default 'statement') in src/config.rs (CdcTriggerMode enum + PGS_CDC_TRIGGER_MODE static).
rebuild_cdc_trigger() in src/cdc.rs: drops + recreates the trigger DDL using the current GUC mode (fn body + trigger DDL).
pgtrickle.rebuild_cdc_triggers() in src/api.rs: iterates all source tables and calls rebuild_cdc_trigger() for each.
sql/pg_trickle--0.3.0--0.4.0.sql: declares the compiled function and calls it to migrate existing row-level triggers on ALTER EXTENSION UPDATE.

B-3: Benchmark Write-Side Improvement ✅ Done

Implemented in tests/e2e_bench_tests.rs as two #[ignore] test functions:

bench_stmt_vs_row_cdc_matrix — full 3×3×2 matrix: narrow/medium/wide × bulk INSERT/UPDATE + single INSERT × row/statement mode. Prints a comparison table with avg/p50/p95 timings and per-cell speedup ratios. Expected runtime: 15–20 min.
bench_stmt_vs_row_cdc_quick — narrow table, bulk INSERT only, 5 iterations. Completes in ~60 s; useful as a smoke-check.

Both benchmarks use alter_system_set_and_wait() to switch CDC trigger mode between row and statement between runs. The output includes per-width bulk-DML speedup percentages and a single-row neutrality check (±10% target).

Run explicitly: bash cargo test --test e2e_bench_tests -- --ignored bench_stmt_vs_row_cdc_matrix --nocapture cargo test --test e2e_bench_tests -- --ignored bench_stmt_vs_row_cdc_quick --nocapture

Expected outcome: 50–80% overhead reduction for bulk DML; neutral (±10%) for single-row DML. Results will be documented in PLAN_PERFORMANCE_PART_10.md once the benchmarks have been run.

6. Phase C: Parallel Refresh Implementation

Per REPORT_PARALLELIZATION.md, implementing Option A+B.

C-1: DAG Level Extraction

Modify StDag::topological_order() to return Vec<Vec<NodeId>> (levels) instead of Vec<NodeId>. Trivial modification to Kahn’s algorithm.

Effort: 2–4 hours.

C-2: Dynamic Background Worker Dispatch

For each level, spawn up to max_concurrent_refreshes dynamic background workers. Each worker: 1. Receives pgt_id via datum 2. Acquires advisory lock (or FOR UPDATE SKIP LOCKED per GAP_SQL_PHASE_7 Q7) 3. Executes refresh_stream_table() 4. Writes result to pgt_refresh_history 5. Exits

The coordinator waits for all workers in a level to complete before advancing to the next level.

Effort: 12–16 hours.

C-3: Result Communication

Workers write refresh outcomes to pgt_refresh_history (already done by refresh_stream_table). The coordinator reads these after worker completion to update retry state and scheduling decisions.

Effort: 3–4 hours.

7. Phase D: MERGE Alternatives & Planner Control

D-1: Hash-Based Change Detection for Wide Tables

Per GAP_SQL_PHASE_7 §G4.6, the MERGE WHEN MATCHED arm uses per-column IS DISTINCT FROM chains. For 100+ columns, this causes plan cache bloat.

Fix: For tables with > 50 columns, replace the per-column chain with a single hash comparison:

WHEN MATCHED AND pg_trickle.pg_trickle_hash(
    row_to_json(st.*)::text
) IS DISTINCT FROM pg_trickle.pg_trickle_hash(
    row_to_json(d.*)::text
) THEN UPDATE SET ...

Effort: 4–6 hours.

D-2: Conditional FULL Bypass for Aggregates

When all groups in an aggregate query are affected (common at high change rates with few groups), FULL refresh is always cheaper. Detect this condition before executing the delta:

if affected_groups_estimate == total_groups:
    use FULL refresh

The affected groups can be estimated from the change buffer: SELECT COUNT(DISTINCT group_key) FROM changes WHERE ...

If this equals the number of groups in the stream table, skip INCR.

Effort: 3–4 hours.

D-3: COST-Based Strategy Selection

Currently, the strategy decision uses a simple ratio (change_count / table_size > threshold). A more sophisticated approach would estimate the actual cost:

estimated_incr_cost = delta_rows × (cte_count × scan_cost + merge_cost)
estimated_full_cost = table_size × scan_cost + truncate_cost

Use the historical pgt_refresh_history data to calibrate the cost model per stream table.

Effort: 6–8 hours.

8. Phase E: Memory and I/O Budget Management

E-1: Delta Size Estimation Before Execution

Before executing the MERGE, estimate the delta result set size:

SELECT COUNT(*) FROM (delta_query LIMIT 10001) t

If the estimate exceeds a threshold, abort INCR and fall back to FULL. This prevents OOM for unexpected large deltas.

Effort: 2–3 hours.

E-2: work_mem Scaling

The current pg_trickle.merge_work_mem_mb GUC sets a fixed work_mem for large deltas. A smarter approach: scale work_mem proportional to estimated delta size:

work_mem = max(base_work_mem, delta_rows * avg_row_width / 1024)

Capped at pg_trickle.merge_work_mem_max_mb.

Effort: 2 hours.

E-3: Change Buffer Partitioning for High-Write Tables

For source tables with > 100K writes/minute, the change buffer table becomes a hotspot. Partition by lsn range (monthly/weekly) to allow fast TRUNCATE of old partitions.

Effort: 8–12 hours. Defer until trigger overhead benchmarks confirm this is a bottleneck.

9. Benchmark Infrastructure Improvements

Current Gaps

The existing benchmark infrastructure has several gaps that limit insight into system performance:

I-1: diff_operators Benchmark Cannot Run Locally

Problem: The diff_operators Criterion bench links against pg_trickle library code that uses pgrx symbols (SPI_processed, pg_sys::*). On macOS without PostgreSQL server libraries installed, the dyld loader fails with symbol not found in flat namespace '_SPI_processed'.

Fix: Refactor diff_operators.rs to depend only on pure-Rust types (OpTree, DiffContext, Expr, Column, Frontier). The diff operators themselves are pure Rust — the issue is that the benchmark binary links against the full pg_trickle crate which includes SPI-using modules.

Recommended: © as quick fix, (b) as long-term solution. Option © gets the benchmarks running immediately. Option (b) is the correct architectural fix — pure-Rust logic should not require a PostgreSQL installation to benchmark.

I-2: No Latency Distribution / Histogram Output

Problem: The E2E benchmarks report avg, median, P95, and cycle-1 vs cycles-2+ breakdowns. But there’s no histogram or full distribution output. P95 spikes (up to 7.6x median) are noted but not diagnosed.

Fix: Add per-cycle timing output as a parseable CSV-like format:

[BENCH_CYCLE] scenario=join rows=100000 pct=0.01 cycle=1 mode=INCR ms=44.3
[BENCH_CYCLE] scenario=join rows=100000 pct=0.01 cycle=2 mode=INCR ms=7.1
...

This enables external analysis (histogram plots, outlier identification, trend detection) without changing the test code significantly.

Effort: 2 hours.

I-3: No EXPLAIN ANALYZE Capture for Delta Queries

Problem: We know MERGE dominates (70–97% of time) but don’t know what PostgreSQL is doing inside the MERGE. Is it using hash joins? Nested loops? Seq scans? Parallel workers? Are sorts spilling to disk?

Fix: Add a benchmark mode that captures EXPLAIN (ANALYZE, BUFFERS, FORMAT JSON) for the delta query. Store the plan in /tmp/bench_plans/<scenario>_<size>_<pct>.json for offline analysis.

Implementation: Add a GUC or environment variable PGS_BENCH_EXPLAIN=true that, when set, runs each MERGE as EXPLAIN ANALYZE on the first measured cycle.

Effort: 3–4 hours.

I-4: No Cross-Run Comparison Tool

Problem: Each benchmark run produces printed output. Comparing results across runs (e.g., before/after an optimization) requires manual side-by-side reading, as manually done in PLAN_PERFORMANCE_PART_8.md.

Fix: Write benchmark results to a JSON file (target/bench_results/<timestamp>.json). Provide a comparison script:

just bench-compare target/bench_results/2026-02-22.json target/bench_results/2026-02-26.json

Output: per-scenario deltas with color-coded improvement/regression.

Effort: 4–6 hours.

I-5: No Concurrent Writer Benchmarks

Problem: All benchmarks use a single connection for DML. In production, multiple concurrent writers hit source tables simultaneously. This stress- tests trigger overhead differently (BIGSERIAL contention, change buffer index lock contention, WAL write serialization).

Fix: Extend the trigger overhead benchmark (PLAN_TRIGGERS_OVERHEAD.md) to sweep writer concurrency: 1, 2, 4, 8 connections performing DML simultaneously.

Effort: 4–6 hours (requires pgbench-style parallel harness or tokio::spawn tasks with separate sqlx connections).

I-6: No Table Size Scaling Benchmark (1M+ Rows)

Problem: The current matrix tests 10K and 100K rows. Production tables commonly have 1M–100M rows. At 1M rows / 1% change rate (10K changes), the delta is larger than anything currently benchmarked, and the MERGE touches more stream table rows for the WHEN NOT MATCHED BY SOURCE arm.

Fix: Add a 1M-row tier to the benchmark matrix. Run as an optional bench_large_matrix test (separate from the standard suite due to longer run time).

Effort: 2 hours (increase TABLE_SIZES; may need container memory bump).

I-7: No Window / Lateral / CTE Operator Benchmarks

Problem: The E2E benchmarks test only 5 scenarios (scan, filter, aggregate, join, join_agg). Window functions, lateral subqueries, CTEs, set operations, and their combinations are not benchmarked.

Fix: Add scenarios: - window: SELECT id, region, ROW_NUMBER() OVER (PARTITION BY region ORDER BY amount DESC) FROM src - lateral: SELECT s.*, l.top_score FROM src s, LATERAL (SELECT MAX(score) as top_score FROM src s2 WHERE s2.region = s.region) l - cte: WITH regional AS (SELECT region, SUM(amount) as total FROM src GROUP BY region) SELECT s.*, r.total FROM src s JOIN regional r ON s.region = r.region - union_all: SELECT id, amount, 'high' as tier FROM src WHERE amount > 5000 UNION ALL SELECT id, amount, 'low' FROM src WHERE amount <= 5000

Effort: 4–6 hours.

I-8: Criterion Benchmark Noise Reduction

Problem: The Criterion benchmarks show 10–19% outlier rates across many tests, suggesting system noise (background processes, thermal throttling, memory pressure). This makes it difficult to detect real regressions.

Fix: - Increase sample_size from default 100 to 200 for critical benchmarks - Add measurement_time(Duration::from_secs(10)) for fast benchmarks to get more iterations - Document that benchmarks should be run on a quiet system with nice -n -20 (or macOS equivalent) - Consider adding a bench-stable justfile target that sets CPU frequency governor (Linux) or reduces measurement noise

Effort: 1–2 hours.

10. Implementation Priority & Schedule

Session 1: Verify & Fix Regressions (3–4 hours) ✅ Done

Step	Task	Effort	Status
A-1	Verify adaptive threshold fix via E2E benchmark	30 min	✅ Done — 0.3x → 0.7x, D-2 triggers correctly
A-2	Verify prepared statements recover join regression	30 min	✅ Done — 12.9x speedup, warm-up 21% improvement
A-3	Fix `prefixed_col_list/20` regression	1 hour	✅ Done — eliminated Vec allocation
A-4	Investigate `lsn_gt` regression	1 hour	✅ Done — use `split_once`

Session 2: Benchmark Infrastructure (8–12 hours) ✅ Done

Step	Task	Effort	Status
I-1c	Run diff_operators in Docker (quick fix)	1 hour	✅ Done — `just bench-docker` target
I-2	Add per-cycle CSV output to E2E benchmarks	2 hours	✅ Done — `[BENCH_CYCLE]` lines
I-3	Add EXPLAIN ANALYZE capture mode	3–4 hours	✅ Done — `PGS_BENCH_EXPLAIN=true`
I-6	Add 1M-row benchmark tier	2 hours	✅ Done — `bench__1m_` + `bench_large_matrix`
I-8	Reduce Criterion noise	1 hour	✅ Done — sample_size(200), measurement_time(10s)

Session 3: Statement-Level Triggers ✅ Complete

Step	Task	Effort	Status
B-1	Implement statement-level trigger generation	8 hours	✅ Done
B-2	GUC + backward compatibility + migration	4 hours	✅ Done
B-3	Benchmark write-side improvement	2 hours	✅ Done

Session 4: Parallel Refresh (16–24 hours) ✅ Done

Step	Task	Effort	Status
C-1	DAG level extraction	2–4 hours	✅ Done — `topological_levels()` on StDag & ExecutionUnitDag
C-2	Dynamic background worker dispatch	12–16 hours	✅ Done — existing `parallel_dispatch_tick` sufficient
C-3	Result communication + error handling	3–4 hours	✅ Done — existing `SchedulerJob` + `pgt_refresh_history`

Session 5: MERGE Optimization (8–12 hours) ✅ Done

Step	Task	Effort	Status
D-1	Hash-based change detection for wide tables	4–6 hours	✅ Done — xxh64 via `pg_trickle_hash`
D-2	Conditional FULL bypass for saturated aggregates	3–4 hours	✅ Done — changes ≥ groups → FULL
D-3	Cost-based strategy selection	6–8 hours	✅ Done — history-based cost model blended with ratio

Session 6: Advanced Benchmarks (8–12 hours) ✅ Done

Step	Task	Effort	Status
I-4	Cross-run comparison tool	4–6 hours	✅ Done — JSON output + `just bench-compare`
I-5	Concurrent writer benchmarks	4–6 hours	✅ Done — ½/4/8 writer sweep
I-7	Window / lateral / CTE operator benchmarks	4–6 hours	✅ Done — 4 new scenarios

Summary

Session	Focus	Effort	Value
1	Regression triage	3–4h	✅ Done: A-3/A-4 fixed; A-1/A-2 deferred to E2E run
2	Benchmark infrastructure	8–12h	✅ Done: I-1c, I-2, I-3, I-6, I-8
3	Statement-level triggers	12–16h	✅ Done: 50–80% write-side overhead reduction
4	Parallel refresh	16–24h	✅ Done: C-1/C-2/C-3 — level-parallel dispatch
5	MERGE optimization	8–12h	✅ Done: D-1/D-2/D-3 — xxh64, saturation bypass, cost model
6	Advanced benchmarks	8–12h	✅ Done: I-4/I-5/I-7 — comparison tool, concurrency, operators
Total		55–80h

Recommended Execution Order

Session 1  →  Verify regressions are fixed          [PREREQUISITE for all]
Session 2  →  Benchmark infrastructure              [Enables data-driven decisions]
Session 3  →  Statement-level triggers               [Highest ROI: write-side]
Session 5  →  MERGE optimization                     [Read-side: strategy selection]
Session 4  →  Parallel refresh                       [Multi-ST scaling]
Session 6  →  Advanced benchmarks                    [Long-term observability]

Session 3 (statement-level triggers) is placed before Session 4 (parallel refresh) because it delivers measurable improvement to every deployment, while parallel refresh only helps multi-ST deployments.

Appendix: Relationship to Other Plans

Plan	Relationship
PLAN_PERFORMANCE_PART_8.md	Direct predecessor. Part 8’s regression analysis (A-1 through A-3) feeds Session 1 here.
REPORT_PARALLELIZATION.md	Session 4 (parallel refresh) implements Options A+B from this report.
PLAN_TRIGGERS_OVERHEAD.md	Session 3 (statement-level triggers) builds on this benchmark design.
STATUS_PERFORMANCE.md	Historical tracking. Update after each session completes.
GAP_SQL_PHASE_7.md	§F27 (expose adaptive threshold), §G4.6 (wide table MERGE), §G8.5 (sequential processing).
GAP_SQL_PHASE_7_QUESTIONS.md	Q7 (PgBouncer: FOR UPDATE SKIP LOCKED) affects parallel refresh locking strategy.

PGXN

PostgreSQL Extension Network

Contents

PLAN_PERFORMANCE_PART_9.md — Strategic Performance Roadmap

Table of Contents

1. Current State Assessment

1.1 Where We Are

1.2 Key Metrics (Part 8 Baseline)

1.3 Problem Areas

2. Benchmark Results (2026-02-26 Criterion Run)

refresh_bench (Pure Rust, no DB)

diff_operators (Requires pgrx/PG linking)

Assessment

3. Higher-Level Performance Decisions

D1: Where Does Time Actually Go? (The 80/20 Rule)

D2: When to Give Up on Incremental

D3: Row-Level vs Statement-Level CDC Triggers

D4: Parallel Refresh Dispatch

D5: UNLOGGED Change Buffers

D6: Column Pruning in Change Buffers

D7: Prepared Statement Strategy

D8: PostgreSQL Parallel Query for Large Refreshes

4. Phase A: Remaining Regressions & Quick Wins

A-1: Verify Adaptive Threshold Fix (Part 8 A-3)

A-2: Verify Prepared Statements Recover Join Regression

A-3: Investigate prefixed_col_list/20 Regression (+34%)

A-4: Investigate lsn_gt Regression (+22%)

5. Phase B: CDC Architecture — Statement-Level Triggers

B-1: Statement-Level Trigger Implementation ✅ Done

B-2: Backward Compatibility & Migration ✅ Done

B-3: Benchmark Write-Side Improvement ✅ Done

6. Phase C: Parallel Refresh Implementation

C-1: DAG Level Extraction

C-2: Dynamic Background Worker Dispatch

C-3: Result Communication

7. Phase D: MERGE Alternatives & Planner Control

D-1: Hash-Based Change Detection for Wide Tables

D-2: Conditional FULL Bypass for Aggregates

D-3: COST-Based Strategy Selection

8. Phase E: Memory and I/O Budget Management

E-1: Delta Size Estimation Before Execution

E-2: work_mem Scaling

E-3: Change Buffer Partitioning for High-Write Tables

9. Benchmark Infrastructure Improvements

Current Gaps

I-1: diff_operators Benchmark Cannot Run Locally

I-2: No Latency Distribution / Histogram Output

I-3: No EXPLAIN ANALYZE Capture for Delta Queries

I-4: No Cross-Run Comparison Tool

I-5: No Concurrent Writer Benchmarks

I-6: No Table Size Scaling Benchmark (1M+ Rows)

I-7: No Window / Lateral / CTE Operator Benchmarks

I-8: Criterion Benchmark Noise Reduction

10. Implementation Priority & Schedule

Session 1: Verify & Fix Regressions (3–4 hours) ✅ Done

Session 2: Benchmark Infrastructure (8–12 hours) ✅ Done

Session 3: Statement-Level Triggers ✅ Complete

Session 4: Parallel Refresh (16–24 hours) ✅ Done

Session 5: MERGE Optimization (8–12 hours) ✅ Done

Session 6: Advanced Benchmarks (8–12 hours) ✅ Done

Summary

Recommended Execution Order

Appendix: Relationship to Other Plans