PLAN_PERFORMANCE_PART_8.md — Residual Bottlenecks & Next-Wave Optimizations

Current Benchmark Results (2026-02-22)

Full Matrix — Summary (avg ms per cycle)

Scenario Rows Chg % FULL ms INCR ms INCR c1 INCR 2+ INCR med INCR P95
scan 10K 1% 32.9 1.4 (22.8x) 1.1 1.5 1.1 2.7
scan 10K 10% 29.2 2.6 (11.2x) 3.8 2.5 2.4 4.4
scan 10K 50% 37.0 36.7 (1.0x) 28.5 37.6 27.6 74.0
scan 100K 1% 300.1 4.6 (65.9x) 1.8 4.9 1.7 9.2
scan 100K 10% 265.2 111.5 (2.4x) 113.0 111.3 116.3 144.6
scan 100K 50% 548.8 821.7 (0.7x) 881.2 815.1 841.4 956.0
filter 10K 1% 21.7 4.8 (4.6x) 2.1 5.1 4.9 9.3
filter 10K 10% 19.4 3.4 (5.7x) 1.6 3.6 1.8 7.6
filter 10K 50% 18.5 21.5 (0.9x) 15.0 22.2 15.9 47.8
filter 100K 1% 134.4 3.4 (39.5x) 1.9 3.6 1.7 9.8
filter 100K 10% 151.7 82.5 (1.8x) 84.4 82.3 83.0 92.0
filter 100K 50% 212.1 175.4 (1.2x) 191.4 173.7 180.6 206.8
aggregate 10K 1% 4.4 2.3 (1.9x) 2.7 2.2 1.7 4.3
aggregate 10K 10% 7.5 1.9 (4.0x) 1.8 1.9 1.7 2.8
aggregate 10K 50% 7.6 8.5 (0.9x) 5.8 8.8 5.5 22.4
aggregate 100K 1% 17.2 2.5 (6.8x) 2.4 2.5 1.7 4.6
aggregate 100K 10% 18.3 10.7 (1.7x) 19.7 9.6 9.4 18.8
aggregate 100K 50% 25.3 40.4 (0.6x) 30.6 41.4 31.4 70.9
join 10K 1% 38.4 11.1 (3.5x) 44.3 7.4 7.1 28.8
join 10K 10% 29.6 33.2 (0.9x) 23.5 34.2 15.6 111.7
join 10K 50% 32.9 28.8 (1.1x) 25.8 29.1 26.4 40.8
join 100K 1% 294.0 18.0 (16.3x) 20.1 17.8 17.5 21.0
join 100K 10% 453.8 188.2 (2.4x) 151.5 192.2 165.9 328.4
join 100K 50% 348.1 320.2 (1.1x) 297.7 322.6 308.8 362.4
join_agg 10K 1% 6.1 6.0 (1.0x) 9.4 5.6 5.5 8.0
join_agg 10K 10% 7.5 14.1 (0.5x) 13.9 14.1 12.6 18.7
join_agg 10K 50% 8.1 10.2 (0.8x) 7.7 10.5 7.6 21.9
join_agg 100K 1% 27.8 14.6 (1.9x) 15.7 14.5 10.9 30.5
join_agg 100K 10% 31.4 95.4 (0.3x) 98.4 95.0 92.3 111.1
join_agg 100K 50% 33.1 35.1 (0.9x) 34.5 35.2 34.6 37.3

Per-Phase Timing Breakdown (DIFFERENTIAL avg ms)

Scenario Rows Chg % Decision Gen+Build Merge Cleanup Path
scan 10K 1% 0.23 0.14 1.07 0.11 cache_hit
scan 10K 10% 0.28 0.09 2.24 0.22 cache_hit
scan 100K 1% 0.52 0.05 6.46 0.47 cache_hit
scan 100K 10% 1.53 0.06 100.30 7.74 cache_hit
filter 10K 1% 0.39 0.19 1.52 0.13 cache_miss
filter 10K 10% 0.38 0.12 3.19 0.28 cache_hit
filter 100K 1% 0.49 0.12 6.24 0.48 cache_hit
filter 100K 10% 1.51 0.05 74.90 3.45 cache_hit
aggregate 10K 1% 0.30 0.15 0.82 0.12 cache_hit
aggregate 10K 10% 0.28 0.04 0.96 0.19 cache_hit
aggregate 100K 1% 0.46 0.08 1.89 0.39 cache_hit
aggregate 100K 10% 0.94 0.05 5.35 1.59 cache_hit
join 10K 1% 0.51 0.03 5.09 1.18 cache_hit
join 10K 10% 0.66 0.03 11.68 1.19 cache_hit
join 100K 1% 0.72 0.09 12.85 1.07 cache_hit
join 100K 10% 1.85 0.10 168.38 4.20 cache_hit
join_agg 10K 1% 0.48 0.09 2.64 0.65 cache_hit
join_agg 10K 10% 0.77 0.09 8.13 1.30 cache_hit
join_agg 100K 1% 0.71 0.09 6.67 0.88 cache_hit
join_agg 100K 10% 1.80 0.10 79.28 4.53 cache_hit

No-Data Refresh Latency

  • Avg: 9.73 ms (target <10 ms: ✅ PASS, borderline)
  • Max: 80.38 ms (cold start)

Criterion Micro-Benchmarks (delta SQL generation, pure Rust)

Benchmark Time (µs)
diff_scan/3cols 9.9
diff_scan/10cols 24.1
diff_scan/20cols 47.7
diff_filter 11.3
diff_project 10.7
diff_aggregate/count_star 9.3
diff_aggregate/sum_count_avg 21.7
diff_inner_join 30.7
diff_left_join 26.8
diff_distinct 13.6
diff_union_all/3_children 18.8
diff_union_all/5_children 45.3
diff_union_all/10_children 105.9
diff_window_row_number 17.1
diff_join_aggregate 54.2
diff_cte_simple 16.1
diff_lateral_srf 13.5

Comparison with Previous Checkpoints

Key Scenario: 100K rows, 1% changes

Checkpoint scan INCR ms filter INCR ms join INCR ms agg INCR ms join_agg INCR ms
Baseline 572.4 (0.7x) 126.0 (1.0x) 336.1 (1.0x) 22.3 (1.3x) 45.3 (1.2x)
After P1+P2 140.3 (2.7x) 68.8 (2.7x) 163.9 (2.7x) 24.3 (1.4x) 33.9 (1.5x)
After P7 46.5 (7.0x) 34.0 (7.9x) 63.5 (6.0x) 10.9 (2.6x) 21.6 (3.1x)
After Part 6 8.3 (41.7x) 7.4 (26.3x) 12.3 (30.4x) 5.7 (4.4x) 8.5 (4.9x)
Now (Part 8) 4.6 (65.9x) 3.4 (39.5x) 18.0 (16.3x) 2.5 (6.8x) 14.6 (1.9x)

Observations vs Part 6

Scenario Part 6 → Now Notes
scan 100K/1% 8.3 → 4.6 (1.8x faster) Continued improvement, now under 5ms
filter 100K/1% 7.4 → 3.4 (2.2x faster) Excellent; near-zero overhead
agg 100K/1% 5.7 → 2.5 (2.3x faster) Big improvement
join 100K/1% 12.3 → 18.0 (0.7x regression) Regression — investigate
join_agg 100K/1% 8.5 → 14.6 (0.6x regression) Regression — investigate

Pipeline Overhead (Decision + Gen+Build)

  • Consistently < 1ms for cache hits across all scenarios at 1% change rate
  • Gen+Build ≤ 0.19ms — delta SQL generation is negligible
  • MERGE dominates: 70–97% of total time in every scenario

Analysis & Bottleneck Identification

B1: Join & Join_Agg Regressions (100K/1%)

Symptom: join 12.3→18.0ms (+46%), join_agg 8.5→14.6ms (+72%).

Probable causes: - Rescan CTE overhead: The new rescan CTE (added for group-rescan aggregates) adds a LEFT JOIN to the merge CTE. Even for algebraic aggregates (SUM/COUNT), the CTE is structurally present — the planner must still evaluate build_rescan_cte() returning None, but the code should skip the JOIN entirely. Need to verify the generated SQL doesn’t have any residual artifact. - Docker image changes: The E2E image was rebuilt with all recent code changes. Compilation flags or shared library differences could cause planner behavior changes. - Benchmark variance: Join P95 = 21.0ms with median 17.5ms. Previous Part 6 numbers had median = 13.2ms but that run used different Docker image builds on potentially different host load conditions. The join 10K/1% cycle-1 spike of 44.3ms (vs 7.4 on 2+) suggests cache warming variance.

Action: Profile the generated delta SQL for join 100K/1% to isolate whether the CTE structure changed, or if PostgreSQL planner behavior shifted.

B2: join_agg Severe Regressions at 10%

Symptom: join_agg 100K/10% → 95.4ms (0.3x FULL). This is the worst scenario in the matrix.

Root cause: At 10%, ~10K change rows feed through the join+aggregate pipeline. The MERGE must: 1. Semi-join delta keys against the base table (10K keys) 2. Re-aggregate N groups 3. MERGE into the stream table

With only 5 groups, 10K changes touching all 5 groups means every group gets re-aggregated. The incremental path does strictly more work than FULL (which is a simple TRUNCATE + INSERT INTO ... SELECT).

Potential fix: Adaptive fallback should detect this and switch to FULL. Verify auto_threshold is triggering correctly for join_agg at 10%.

B3: scan 100K/50% Severe Regression (0.7x)

Symptom: 821.7ms vs 548.8ms FULL. P95 = 956ms.

Root cause: At 50% change rate, 50K rows flow through the delta pipeline. The MERGE INTO with 50K delta rows competing against 100K existing rows is strictly more expensive than TRUNCATE + INSERT INTO ... SELECT with 100K rows (which is a simple sequential scan + load). The adaptive threshold should be catching this.

Action: Lower the adaptive threshold or ensure it is operational for scan queries.

B4: No-Data Latency at Target Boundary (9.73ms)

Symptom: Avg 9.73ms, barely passing the <10ms target. Max = 80.38ms.

Analysis: Previous measurements were ~3ms. The increase likely comes from: - Additional SPI calls in the rescan CTE infrastructure (checking for group-rescan aggregates) - Changed Docker host conditions - Benchmark run variance

Action: Re-run with more iterations. If consistently >5ms, profile the no-data path.

B5: Aggregate P95 Instability (10K/50%)

Symptom: agg 10K/50% median = 5.5ms but P95 = 22.4ms (4x spike).

Root cause: PostgreSQL plan cache invalidation. Every Nth execution triggers re-planning, which costs ~10-15ms. With 50% changes affecting all 5 groups, the merge complexity varies across cycles.

B6: MERGE Dominance — The Fundamental Limit

The per-phase breakdown shows MERGE execution accounts for: - 1ms at 10K/1% (fast — minimal work) - 5-13ms at 100K/1% (moderate) - 74-168ms at 100K/10% (dominates total time)

The pipeline overhead (Decision + Gen+Build) is <1ms — essentially zero. Any further speedup must come from the MERGE SQL itself or from skipping it entirely.


Proposed Optimizations

Phase A: Regression Investigation & Fixes (Priority: HIGH)

A-1: Verify rescan CTE not leaking into algebraic aggregate SQL

Goal: Confirm that diff_aggregate for SUM/COUNT/AVG does NOT add a rescan CTE or LEFT JOIN.

Steps: 1. Add a unit test asserting no agg_rescan appears in diff_aggregate output for SUM/COUNT queries 2. If found, fix build_rescan_cte to return None more efficiently for algebraic-only aggregate lists

Effort: 30 min. Impact: May explain the join_agg 100K/1% regression.

A-2: Profile join 100K/1% delta SQL

Goal: Compare the generated MERGE SQL from Part 6 vs now.

Steps: 1. Add [PGS_DEBUG] SQL logging for join queries 2. Run benchmark, capture the full delta SQL 3. Compare CTE structure with /target/criterion/ historical baselines 4. Check if any new CTEs were introduced

Effort: 1 hour. Impact: Diagnose +46% join regression.

A-3: Audit adaptive threshold for regression scenarios

Goal: Verify auto_threshold triggers FULL fallback for join_agg 10K/10% and scan 100K/50%.

Steps: 1. Add [PGS_PROFILE] logging showing threshold decisions 2. Verify that after 2-3 cycles of INCR being slower, the threshold auto-adjusts 3. If not triggering, lower the default threshold or fix the calculation

Effort: 1 hour. Impact: Should eliminate 0.3x-0.7x scenarios by falling back to FULL.

Phase B: MERGE Optimization (Priority: HIGH)

B-1: Conditional MERGE bypass for no-change groups (aggregates)

Problem: For algebraic aggregates with 5 groups (SUM/COUNT), every delta cycle re-checks all 5 groups even if some have zero changes. The MERGE must read from delta, LEFT JOIN the stream table, compute new values, and write for each group — even when new_count = old_count AND new_sum = old_sum.

Fix: Add a pre-MERGE filter CTE that eliminates unchanged groups before the MERGE:

-- Before MERGE, filter the final CTE to only groups with actual changes
WITH __pgt_changed_groups AS (
    SELECT * FROM __pgt_cte_agg_final_N
    WHERE __pgt_action IN ('I', 'D')  -- already done
)
MERGE INTO st USING __pgt_changed_groups ...

This is already present (the WHERE ... IS DISTINCT FROM guard). The issue may be that PostgreSQL MERGE evaluates all matched rows even when the WHEN clause filters them. Consider splitting to DELETE + INSERT with pre-filtered CTEs instead of MERGE.

Effort: 2 hours. Impact: 10-30% at low change rates where many groups are unchanged.

B-2: Partitioned MERGE for large deltas

Problem: At 100K/10%, the MERGE processes 10K delta rows in a single SQL statement. PostgreSQL’s planner may choose a nested-loop join strategy that scales poorly.

Fix: For delta sizes > 1000 rows, split into batched MERGEs by group key hash or row_id range:

-- Instead of one MERGE with 10K rows:
MERGE INTO st USING (SELECT * FROM delta WHERE __pgt_row_id % 4 = 0) ...
MERGE INTO st USING (SELECT * FROM delta WHERE __pgt_row_id % 4 = 1) ...
-- etc.

Effort: 3 hours. Impact: Speculative — depends on planner behavior. Risk of increased overhead from multiple statements.

Decision: Defer until B-1 and A-3 are evaluated.

B-3: Replace MERGE with DELETE + INSERT for large deltas

Problem: PostgreSQL MERGE has overhead from the WHEN MATCHED/NOT MATCHED branching and tuple visibility checks. For scenarios where most rows change, DELETE changed_rows; INSERT new_rows may be cheaper.

Fix: When delta covers >25% of stream table rows, use: sql DELETE FROM st WHERE (key_cols) IN (SELECT key_cols FROM delta); INSERT INTO st SELECT ... FROM delta WHERE __pgt_action = 'I';

Effort: 2 hours. Impact: Potentially significant at 50% — eliminates MERGE planning overhead. Risk: two statements vs one.

Decision: Implement behind a GUC flag (pg_trickle.merge_strategy = 'auto'|'merge'|'delete_insert').

Phase C: Cleanup & Buffer Optimization (Priority: MEDIUM)

C-1: Async change buffer cleanup

Problem: Cleanup takes 0.5-7.7ms at 100K. Currently synchronous via TRUNCATE.

Fix: Move cleanup to a deferred callback or background worker tick. The change buffer is bounded by LSN range, so stale rows are harmless as long as the next refresh uses the correct frontier.

Effort: 2 hours. Impact: Saves 0.5-7ms per refresh, critical path reduction.

C-2: Trigger write amplification reduction

Problem: At high write rates, every source DML fires a trigger that INSERTs into the change buffer. This means every source INSERT/UPDATE/DELETE generates: - 1 WAL record for the source table - 1 trigger execution - 1 INSERT into the change buffer (+ WAL record + index maintenance)

Fix: Statement-level triggers with transition tables (PostgreSQL AFTER STATEMENT with referencing OLD/NEW TABLE). This batches all changes from a single statement into one buffer INSERT:

CREATE TRIGGER pg_trickle_cdc_tr AFTER INSERT OR UPDATE OR DELETE
ON source_table REFERENCING NEW TABLE AS new_rows OLD TABLE AS old_rows
FOR EACH STATEMENT EXECUTE FUNCTION pg_trickle_cdc_stmt_fn();

Effort: 8 hours (requires CDC rewrite). Impact: 50-80% reduction in trigger overhead at high write volumes. No change for single-row DML.

Decision: Defer to Part 9 — this is a major CDC architecture change.

Phase D: Planner Hints & Stability (Priority: MEDIUM)

D-1: Explicit join strategy hints for MERGE

Problem: P95 spikes in join scenarios (join 10K/10% P95 = 111.7ms vs median = 15.6ms) suggest planner instability.

Fix: Add SET LOCAL hints before MERGE execution: sql SET LOCAL enable_nestloop = off; -- for large deltas SET LOCAL work_mem = '64MB'; -- for hash joins

Apply conditionally based on estimated delta size: - delta < 100 rows: no hints (let planner optimize for small data) - delta 100-10K rows: enable_nestloop = off - delta > 10K rows: enable_nestloop = off + work_mem = 64MB

Effort: 2 hours. Impact: Should reduce P95 variability by 50%+.

D-2: SPI prepared statements

Problem: Every MERGE execution parses the SQL string. With template caching, the SQL is identical across cycles (only LSN placeholders differ), but PostgreSQL still parses it fresh each time.

Fix: Use SPI_prepare() + SPI_execute_plan() for the MERGE statement. Cache the plan handle alongside the SQL template.

Caveat: Previous attempt (H-W2 in Part 4) showed net-negative results because PostgreSQL’s custom plan mode re-plans every EXECUTE for the first 5 executions. Need ≥6 cycles per stream table before the generic plan locks in. With CYCLES=10 this should work.

Effort: 4 hours. Impact: Saves ~1-2ms parse time per refresh. Higher impact at low change rates where parse is a larger fraction of total time.

Decision: Implement and benchmark with CYCLES=20 to amortize plan cache warmup.

Phase E: No-Data Path Optimization (Priority: LOW)

E-1: Ultra-fast empty-buffer check

Problem: No-data latency = 9.73ms. Target is <10ms (barely passing).

Fix: Replace the current count(*) decision query with an EXISTS check that short-circuits on the first row: sql SELECT EXISTS( SELECT 1 FROM pgtrickle_changes.changes_OID WHERE lsn > $prev_lsn AND lsn <= $new_lsn LIMIT 1 )

If this returns FALSE, skip all subsequent processing (no Gen, no Build, no MERGE, no Cleanup).

Effort: 1 hour. Impact: Should bring no-data from ~10ms to <3ms.

E-2: Shared memory fast-path for zero-change detection

Problem: Even the EXISTS check requires an SPI call into PostgreSQL.

Fix: Maintain a per-source-table atomic counter in shared memory (via PgAtomic). The CDC trigger increments it on every write. The refresh function reads it — if zero since last refresh, skip SPI entirely.

Effort: 3 hours. Impact: No-data latency → <1ms. Requires shared memory initialization.

Decision: Implement E-1 first; E-2 only if E-1 is insufficient.


Implementation Priority & Schedule

Session 1: Regression Triage (A-1, A-2, A-3) — ✅ COMPLETED

A-1: Rescan CTE not leaking — ✅ CONFIRMED. Added 6 unit tests verifying no agg_rescan CTE is generated for SUM, COUNT(*), AVG, MIN, MAX, and SUM+COUNT+AVG combined queries. The build_rescan_cte function correctly returns None for algebraic-only aggregate lists. The rescan CTE infrastructure does NOT contribute to the join/join_agg regressions.

A-2: Join regression root cause — ✅ IDENTIFIED. The join 100K/1% regression (12.3ms → 18.0ms) is a pre-existing issue from the PREPARE/EXECUTE revert (P1+P2 commits). The 11-CTE join delta query incurs significant planning overhead when planned fresh every cycle. This is NOT caused by recent changes (rescan CTE, TABLESAMPLE rejection, test fixes). Recovery requires restoring SPI prepared statements (Session 5, D-2).

A-3: Adaptive threshold bug — ✅ FIXED. last_full_ms was never set during initial materialization, keeping it NULL forever for STs whose change rate never exceeded the 15% fallback threshold. The auto-tuner code (if let Some(last_full) = st.last_full_ms) was dead code for these STs. Fixed by recording the initial materialization time as last_full_ms during create_stream_table. This enables the auto-tuner from the first differential refresh onward, which will correctly lower the threshold for scenarios like join_agg at 10% where INCR is slower than FULL. Also added debug logging when the auto-tuner adjusts the threshold.

Session 2: No-Data & Cleanup Fast Path (E-1, C-1) — ✅ COMPLETED

E-1: Ultra-fast EXISTS no-data short-circuit - Replaced heavy LATERAL+capped-count decision query with two-phase approach - Phase 1: Fast SELECT EXISTS(...) check — single SPI call for single-source; UNION ALL wrapped in EXISTS() for multi-source (short-circuits on first row) - Phase 2: Capped-count threshold check only runs when changes actually exist, with early break on first source exceeding the FULL fallback threshold - Expected outcome: No-data path avoids pg_class lookup and CASE expression; should bring no-data latency well under 5ms

C-1: Deferred change buffer cleanup - Added PendingCleanup struct + PENDING_CLEANUP thread-local queue - Cleanup is now enqueued (near-zero cost) instead of executing inline - drain_pending_cleanups() runs at the start of the NEXT refresh cycle - Safety: LSN-range predicates in delta queries ensure stale rows are never re-consumed, so deferred cleanup is fully safe - Profiling label updated from cleanup to cleanup_enqueue to reflect deferral - Expected outcome: 0.5–7ms savings on every differential refresh

Session 3: Planner Stability (D-1) — ✅ COMPLETED

D-1: Conditional SET LOCAL planner hints based on delta size - Added apply_planner_hints() function with three tiers: - delta < 100 rows: no hints (let planner optimise for small data) - delta 100–9 999: SET LOCAL enable_nestloop = off (favour hash joins) - delta >= 10 000: also SET LOCAL work_mem = '<N>MB' (avoid disk-spill) - SET LOCAL is automatically reset at transaction end — no cleanup needed - Accumulated total_change_count from the capped-count threshold loop to feed the hint tier decision - New GUCs: - pg_trickle.merge_planner_hints (bool, default true) — master switch - pg_trickle.merge_work_mem_mb (int, default 64) — work_mem for large deltas - Profiling line now includes delta_est=<N> and hints=<tier> fields - Expected outcome: P95 reduction for join/join_agg scenarios where nested-loop plans cause latency spikes

Session 4: MERGE Strategy (B-1, B-3) — ✅ COMPLETED

B-1: IS DISTINCT FROM guard to skip no-op UPDATEs - Added IS DISTINCT FROM check on the MERGE WHEN MATCHED ... THEN UPDATE clause so unchanged rows are skipped entirely (no heap write) - The guard is: AND (st.col1 IS DISTINCT FROM d.col1 OR st.col2 IS DISTINCT FROM d.col2 OR ...) - Applied in both the prewarm_merge_cache path and the cache-miss build path - Expected outcome: Eliminates unnecessary I/O for aggregate groups whose recomputed values are identical to the current values

B-3: DELETE + INSERT alternative for large deltas (behind GUC) - New GUC: pg_trickle.merge_strategy (string: auto/merge/delete_insert) - In auto mode (default), switches to DELETE+INSERT when the estimated delta exceeds 25% of the source table row count (MERGE_STRATEGY_AUTO_THRESHOLD) - DELETE+INSERT is two statements: 1. DELETE FROM st WHERE __pgt_row_id IN (SELECT __pgt_row_id FROM delta) 2. INSERT INTO st SELECT ... FROM delta WHERE __pgt_action = 'I' - Both MERGE and DELETE+INSERT templates are cached alongside each other in CachedMergeTemplate, with strategy selection at execution time - Profiling line now includes strategy=merge|delete_insert field - Also accumulated total_table_size from the capped-count threshold loop to feed the auto-strategy decision - Expected outcome: 10–30% improvement at 50% change rates where MERGE’s tuple visibility overhead dominates

Session 5: Prepared Statements (D-2) ✅ COMPLETED

  • SQL PREPARE / EXECUTE for MERGE (not C-level SPI_prepare)
  • New GUC: pg_trickle.use_prepared_statements (default true)
  • Parameterized MERGE template with $N positional params for LSN values
  • PREPARE issued on first cache-hit, EXECUTE on subsequent cycles
  • PostgreSQL switches from custom → generic plan after ~5 executions
  • DEALLOCATE on cache invalidation; session-level lifetime
  • 9 new unit tests (parameterize_lsn_template, build_prepare_type_list, build_execute_params)
  • 841 unit tests passing, fmt+lint clean
  • Expected outcome: 1-2ms savings per refresh on cache hits

Targets for Part 8

Metric Current Target
scan 100K/1% 4.6ms (65.9x) < 5ms (maintain)
filter 100K/1% 3.4ms (39.5x) < 4ms (maintain)
aggregate 100K/1% 2.5ms (6.8x) < 3ms (maintain)
join 100K/1% 18.0ms (16.3x) < 12ms (recover)
join_agg 100K/1% 14.6ms (1.9x) < 10ms (recover)
join_agg 100K/10% 95.4ms (0.3x) > 1.0x (fix)
scan 100K/50% 821.7ms (0.7x) > 1.0x (fix)
No-data latency 9.73ms < 5ms
P95 / median ratio up to 7.6x < 3x

Summary of Changes Since Part 7

Since the Part 6/7 benchmarks, the following code changes were made that may affect performance:

  1. Rescan CTE (src/dvm/operators/aggregate.rs): Added build_rescan_cte(), child_to_from_sql(), agg_to_rescan_sql() for group-rescan aggregates. This adds a LEFT JOIN to the merge CTE for group-rescan aggregates (BIT_AND, STDDEV, etc). For algebraic aggregates (SUM/COUNT/AVG), build_rescan_cte returns None and no JOIN is added. The standard benchmark queries (SUM+COUNT) should not be affected, but needs verification (A-1).

  2. TABLESAMPLE rejection (src/dvm/parser.rs): Added T_RangeTableSample check. No performance impact.

  3. Test changes: Updated E2E tests for PERCENTILE_CONT, STDDEV, EXISTS, window functions, JSON_OBJECT_AGG→JSONB_OBJECT_AGG. No runtime performance impact.

  4. Docker image rebuild: The E2E image was rebuilt with all changes. Different cargo build compilation could produce different binary characteristics.