v0.13.0.md-full / PostgreSQL Extension Network

- v0.13.0 — Scalability Foundations, Partitioning Enhancements, MERGE Profiling & Multi-Tenant Scheduling

Plain-language companion: v0.13.0.md

v0.13.0 — Scalability Foundations, Partitioning Enhancements, MERGE Profiling & Multi-Tenant Scheduling

Status: Released (2026-03-31).

Goal: Deliver the scalability foundations deferred from v0.12.0 — columnar change tracking and shared change buffers — alongside the partitioning enhancements that build on v0.11.0’s RANGE partitioning spike, a MERGE deduplication profiling pass, the dbt macro updates, per-database worker quotas for multi-tenant deployments, the TPC-H-derived benchmarking harness for data-driven performance validation, and a small SQL coverage cleanup for PG 16+ expression types.

Phases: 5 (Scalability), 6 (Partitioning), 7 (MERGE Profiling), 8 (dbt Macro Updates), 9 (Multi-Tenant Scheduler Isolation), 10 (TPC-H Benchmark Harness), 11 (SQL Coverage Cleanup).

Scalability Foundations (Phase 5)

In plain terms: These items directly serve the project’s primary goal of world-class performance and scalability. Columnar change tracking eliminates wasted delta processing for wide tables, and shared change buffers reduce I/O multiplication in deployments with many stream tables reading from the same source.

Item	Description	Effort	Ref
A-2	Columnar Change Tracking. Per-column bitmask in CDC triggers; skip rows where no referenced column changed; lightweight UPDATE-only path when only projected columns changed; 50–90% delta-volume reduction for wide-table UPDATE workloads.	3–4 wk	PLAN_NEW_STUFF.md §A-2 \| ✅ Done
D-4	Shared Change Buffers. Single buffer per source shared across all dependent STs; multi-frontier cleanup coordination; static-superset column mode for initial implementation.	3–4 wk	PLAN_NEW_STUFF.md §D-4 \| ✅ Done
~~PERF-2~~	~~Auto-enable `buffer_partitioning` for high-throughput sources.~~ ✅ Done — throughput-based auto-promotion: buffer exceeding `compact_threshold` in a single refresh cycle is converted to RANGE(lsn) partitioned mode at runtime.	—	REPORT_OVERALL_STATUS.md §R7

⚠️ D-4 multi-frontier cleanup correctness verified. MIN(consumer_frontier) used in all cleanup paths. Property-based tests with 5–10 consumers and 500 random frontier advancement cases pass.

Scalability foundations subtotal: ~6–8 weeks

Partitioning Enhancements (Phase 6)

In plain terms: The v0.11.0 spike delivered RANGE partitioning end-to-end. These follow-on items extend coverage to the use cases deliberately deferred from A1: multi-column keys, retrofitting existing stream tables, LIST-based partitions, HASH partitions, and operational quality-of-life improvements.

Item	Description	Effort	Ref
~~A1-1b~~	Multi-column partition keys. Comma-separated `partition_by`; ROW() predicate for composite keys. ✅ Done	—	src/api.rs, src/refresh.rs
~~A1-1c~~	`alter_stream_table(partition_by => …)` support. Add/change/remove partition key with full storage rebuild. ✅ Done	—	src/api.rs, src/catalog.rs
~~A1-1d~~	LIST partitioning support. `PARTITION BY LIST` for low-cardinality columns; `IN (…)` predicate style from the delta. ✅ Done	—	src/api.rs, src/refresh.rs
~~A1-3b~~	HASH partitioning via per-partition MERGE loop. `HASH:col[:N]` with auto-created child partitions; per-partition MERGE through `satisfies_hash_partition()`. ✅ Done	—	src/api.rs, src/refresh.rs
~~PART-WARN~~	Default-partition growth warning. `warn_default_partition_growth()` after FULL and DIFFERENTIAL refresh. ✅ Done	—	src/refresh.rs

Partitioning enhancements subtotal: ~5–8 weeks

MERGE Profiling (Phase 7)

Item	Description	Effort	Ref
G14-MDED	MERGE deduplication profiling. Profile how often concurrent-write scenarios produce duplicate key entries requiring pre-MERGE compaction. If ≥10% of refresh cycles need dedup, write an RFC for a two-pass MERGE strategy.	3–5d	plans/performance/REPORT_OVERALL_STATUS.md §14
PROF-DLT	Delta SQL query plan profiling (`explain_delta()` function). Capture `EXPLAIN (ANALYZE, BUFFERS, FORMAT JSON)` for auto-generated delta SQL queries to identify PostgreSQL execution bottlenecks (join algorithms, scan types, sort spills). Add `pgtrickle.explain_delta(st_name, format DEFAULT 'text')` SQL function; optional `PGS_PROFILE_DELTA=1` environment variable for E2E test auto-capture to `/tmp/delta_plans/<st>.json`. Enables identification of operator-level performance issues (semi-join full scans, deep join chains). Prerequisite for data-driven MERGE optimization.	1–2w	PLAN_TPC_H_BENCHMARKING.md §1-5

MERGE profiling subtotal: ~1–3 weeks

dbt Macro Updates (Phase 8)

In plain terms: Expose the v0.11.0 SQL API additions (partition_by, fuse, fuse_ceiling, fuse_sensitivity) in the dbt materialization macros so dbt users can configure them via config(...). No catalog changes; pure Jinja/SQL.

Item	Description	Effort
DBT-1	`partition_by` config option wired through `stream_table.sql`, `create_stream_table.sql`, and `alter_stream_table.sql`	~1d
DBT-2	`fuse`, `fuse_ceiling`, `fuse_sensitivity` config options wired through the materialization and alter macro with change-detection logic	~1–2d
DBT-3	dbt docs update: README and SQL_REFERENCE.md dbt section	~0.5d

dbt macro updates subtotal: ~2–3.5 days

Multi-Tenant Scheduler Isolation (Phase 9)

In plain terms: As deployments grow past 10 databases on a single cluster, all schedulers compete for the same global background-worker pool. One busy database can starve the others. Phase 9 gives operators per-database quotas and a priority queue so critical databases always get workers.

Item	Description	Effort	Ref
~~C-3~~	Per-database worker quotas. Add `pg_trickle.per_database_worker_quota` GUC; priority ordering: IMMEDIATE > Hot > Warm > Cold STs; burst capacity up to 150% when other databases are under quota. ✅ Done — GUC registered; `compute_per_db_quota()` with 80% burst; tier-aware `sort_ready_queue_by_priority`; 5 unit tests + 6 E2E tests	—	src/scheduler.rs

⚠️ C-3 depends on C-1 (tiered scheduling) for Hot/Warm/Cold classification. If C-1 is not ready, fall back to IMMEDIATE > all-other ordering with equal priority within each tier; add full tier-aware ordering as a follow-on when C-1 lands in v0.14.0.

Multi-tenant scheduler isolation subtotal: ~2–3 weeks

TPC-H Benchmark Harness (Phase 10)

In plain terms: The existing TPC-H correctness suite (22/22 queries passing) has no timing infrastructure. Phase 10 adds benchmark mode so we can measure FULL vs DIFFERENTIAL speedups across all 22 queries — the only way to validate that A-2, D-4, and other v0.13.0 changes actually help on realistic analytical workloads, and to catch per-query regressions at larger scale factors.

Item	Description	Effort	Ref
TPCH-1	`TPCH_BENCH=1` benchmark mode for Phase 3. Instrument `test_tpch_full_vs_differential` with warm-up cycles (`WARMUP_CYCLES=2`), reuse `extract_last_profile()` for `[PGS_PROFILE]` extraction, emit `[TPCH_BENCH]` structured output per cycle (`query=q01 tier=2 cycle=1 mode=DIFF ms=12.7 decision=0.41 merge=11.3 …`). Add `print_tpch_summary()` with per-query FULL/DIFF median, speedup, P95, and MERGE% table.	4–5h	PLAN_TPC_H_BENCHMARKING.md §3 \| ✅ Done
TPCH-2	`just bench-tpch` / `bench-tpch-large` / `bench-tpch-fast` justfile targets. `bench-tpch`: SF-0.01 with `TPCH_BENCH=1`; `bench-tpch-large`: SF-0.1 with 5 cycles; `bench-tpch-fast`: skip Docker image rebuild. Enables before/after measurement for every v0.13.0 optimization.	15 min	PLAN_TPC_H_BENCHMARKING.md §3 \| ✅ Done
TPCH-3	TPC-H OpTree Criterion micro-benchmarks. Add composite `OpTree` benchmarks to `benches/diff_operators.rs` representing TPC-H query shapes (`diff_tpch_q01`, `diff_tpch_q05`, `diff_tpch_q08`, `diff_tpch_q18`, `diff_tpch_q21`). Measures pure-Rust delta SQL generation time for complex multi-join/semi-join trees; catches DVM engine regressions without a running database.	4h	PLAN_TPC_H_BENCHMARKING.md §4 \| ✅ Done

TPC-H benchmark harness subtotal: ~1 day

SQL Coverage Cleanup (Phase 11)

In plain terms: Three small SQL expression gaps that are unscheduled anywhere. Two are PG 16+ standard SQL syntax currently rejected with errors; one is an audit-gated correctness check for recursive CTEs with non-monotone operators. All are low-effort items that round out DVM coverage without adding scope risk.

Item	Description	Effort	Ref
SQL-RECUR	Recursive CTE non-monotone divergence audit. Write an E2E test for a recursive CTE with `EXCEPT` or aggregation in the recursive term (`WITH RECURSIVE … SELECT … EXCEPT SELECT …`). If the test passes → downgrade G1.3 to P4 (verified correct, no code change). If it fails → add a guard in `diff_recursive_cte` that detects non-monotone recursive terms and rejects them with `ERROR: non-monotone recursive CTEs are not supported in DIFFERENTIAL mode — use FULL`.	6–8h	GAP_SQL_PHASE_7.md §G1.3
SQL-PG16-1	`IS JSON` predicate support (PG 16+). `expr IS JSON`, `expr IS JSON OBJECT`, `expr IS JSON ARRAY`, `expr IS JSON SCALAR`, `expr IS JSON WITH UNIQUE KEYS` — standard SQL/JSON predicates rejected today. Add a `T_JsonIsPredicate` arm in `parser.rs`; the predicate is treated opaquely (no delta decomposition); it passes through to the delta SQL unchanged where the PG executor evaluates it natively.	2–3h	GAP_SQL_PHASE_6.md §G1.4
SQL-PG16-2	SQL/JSON constructor support (PG 16+). `JSON_OBJECT(…)`, `JSON_ARRAY(…)`, `JSON_OBJECTAGG(…)`, `JSON_ARRAYAGG(…)` — standard SQL/JSON constructors (`T_JsonConstructorExpr`) currently rejected. Add opaque pass-through in `parser.rs`; treat as scalar expressions (no incremental maintenance of the JSON value itself); handle the aggregate variants the same way as other custom aggregates (full group rescan).	4–6h	GAP_SQL_PHASE_6.md §G1.5

SQL coverage cleanup subtotal: ~1–2 days

DVM Engine Improvements (Phase 10)

In plain terms: The delta SQL generated for deep multi-table joins (e.g., TPC-H Q05/Q09 with 6 joined tables) computes identical pre-change snapshots redundantly at every reference site, spilling multi-GB temporary files that exceed temp_file_limit. Nested semi-joins (Q20) exhibit an O(n²) blowup from fully materializing the right-side pre-change state. These improvements target the intermediate data volume directly in the delta SQL generator, with TPC-H 22/22 DIFFERENTIAL correctness as the measurable gate.

Item	Description	Effort	Ref
DI-1	Named CTE L₀ snapshots. Emit per-leaf pre-change snapshots as named CTEs (`NOT MATERIALIZED` default; `MATERIALIZED` when reference count ≥ 3); deduplicate 3–10× redundant `EXCEPT ALL` evaluations per leaf. Targets Q05/Q09 temp spill root cause.	2–3d	PLAN_DVM_IMPROVEMENTS.md §DI-1
DI-2	Pre-image read from change buffer + aggregate UPDATE-split. Replace per-leaf `EXCEPT ALL` with a `NOT EXISTS` anti-join on `pk_hash` + direct `old_` read. Per-leaf conditional fallback to `EXCEPT ALL` when delta exceeds `max_delta_fraction` for that leaf. Includes aggregate UPDATE-split: the ’D' side of `SUM(CASE WHEN …)` evaluates using `old_` column values, superseding DI-8’s band-aid.	3.5–5.5d	PLAN_DVM_IMPROVEMENTS.md §DI-2
DI-3	Group-key filtered aggregate old rescan. Restrict non-algebraic aggregate `EXCEPT ALL` rescans to affected groups via `EXISTS (… IS NOT DISTINCT FROM …)` filter. NULL-safe. Independent quick win.	0.5–1d	PLAN_DVM_IMPROVEMENTS.md §DI-3
DI-6	Lazy semi-join R_old materialization. Skip `EXCEPT ALL` for unchanged semi-join right children; push down equi-join key as a filter when R_old is needed. Eliminates Q20-type O(n²) blowup.	1–2d	PLAN_DVM_IMPROVEMENTS.md §DI-6
DI-4	Shared R₀ CTE cache. Cache pre-change snapshot SQL by OpTree node identity to avoid regenerating duplicate inline subqueries for shared subtrees. Depends on DI-1.	1–2d	PLAN_DVM_IMPROVEMENTS.md §DI-4
DI-5	Part 3 correction consolidation. Consolidate per-node Part 3 correction CTEs for linear inner-join chains into a single term.	2–3d	PLAN_DVM_IMPROVEMENTS.md §DI-5
DI-7	Scan-count-aware strategy selector. `max_differential_joins` and `max_delta_fraction` per-stream-table options; auto-fallback to FULL refresh when join count or delta-rate threshold is exceeded. Complements DI-2’s per-leaf fallback with a coarser per-ST guard at scheduler decision time.	1–2d	PLAN_DVM_IMPROVEMENTS.md §DI-7
DI-8	SUM(CASE WHEN …) algebraic drift fix. Detect `Expr::Raw("CASE …")` in `is_algebraically_invertible()` and fall back to GROUP_RESCAN. Q14 is unaffected (parsed as `ComplexExpression`, already GROUP_RESCAN). Correctness band-aid superseded by DI-2’s aggregate UPDATE-split.	~0.5d	PLAN_DVM_IMPROVEMENTS.md §DI-8
DI-9	Scheduler skips IMMEDIATE-mode tables. Raise `scheduler_interval_ms` GUC cap to 600,000 ms; return early from refresh-due check for `refresh_mode = IMMEDIATE` (verified safe: IMMEDIATE drains TABLE-source buffers synchronously; downstream CALCULATED tables detected via `has_stream_table_source_changes()` independently).	0.5d	PLAN_DVM_IMPROVEMENTS.md §DI-9
DI-10	SF=1 benchmark validation gate. Add `bench-tpch-sf1` justfile target (`TPCH_SF=1 TPCH_BENCH=1`). Gate v0.13.0 release on 22/22 queries at SF=1. CI: manual dispatch only (60–180 min runtime, 4h timeout).	~0.5d	PLAN_DVM_IMPROVEMENTS.md §DI-10
DI-11	Predicate pushdown + deep-join L₀ threshold + planner hints. (a) Enable `push_filter_into_cross_joins()` with scalar-subquery guard. (b) Deep-join L₀ threshold (4+ scans): skip L₀ reconstruction, use L₁ + Part 3 correction. © Deep-join planner hints (5+ scans): disable nestloop, raise work_mem, override temp_file_limit. Result: 22/22 TPC-H DIFFERENTIAL.	~1d	—

DI-2 promoted from v1.x: CDC old_* column capture was completed as part of the typed-column CDC rewrite (already in production). DI-2 scope includes both the join-level pre-image capture (NOT EXISTS anti-join) and an aggregate UPDATE-split that uses old_* values for the ’D' side of SUM(CASE WHEN …), superseding DI-8’s GROUP_RESCAN band-aid.

Implementation order: DI-8 → DI-9 → DI-1 → DI-3 → DI-2 → DI-6 → DI-4 → DI-5 → DI-7 → DI-10 → DI-11

DVM improvements subtotal: ~2–3 weeks (DI-8/DI-9 are small independent fixes; DI-1–DI-7 are the core engine work; DI-10 is a validation run; DI-11 is predicate pushdown + deep-join optimization)

Regression-Free Testing Initiative (Q2 2026)

Tracking: TESTING_GAPS_2_IMPLEMENTATION_PROPOSAL.md

Addresses 9 structural weaknesses identified in the regression risk analysis. Target: reduce regression escape rate from ~15% to <5%.

Phase	Item	Status
P1	Test infrastructure hardening: `#[must_use]` on poll helpers; `wait_for_condition` with exponential backoff; `assert_column_types_match`	✅ Done (2026-03-28)
P2	Join multi-cycle correctness: 7 tests — LEFT/RIGHT/FULL join, join-key update, both-sides DML, 4-table chain, NULL key	✅ Done (2026-03-28)
P3	Differential ≡ Full equivalence: 11 tests covering every major DVM operator class; `effective_refresh_mode` guard	✅ Done (2026-03-28)
P4	DVM operator execution: LATERAL MAX subquery multi-cycle (5 cycles) + recursive CTE org hierarchy multi-cycle (5 cycles)	✅ Done (2026-03-28)
P5	Failure recovery & schema evolution: 6 failure recovery tests (FR-1..6 in `e2e_failure_recovery_tests.rs`) + 5 schema evolution tests (SE-1..5 in `e2e_ddl_event_tests.rs`)	✅ Done (2026-03-28)
P6	MERGE template unit tests: 8 pure-Rust tests — `determine_refresh_action` (×5) + `build_is_distinct_clause` boundary (×3) in `src/refresh.rs`	✅ Done (2026-03-28)

v0.13.0 total: ~15–23 weeks (Scalability: 6–8w, Partitioning: 5–8w, MERGE Profiling: 1–3w, dbt: 2–3.5d, Multi-tenant: 2–3w, TPC-H harness: ~1d, SQL cleanup: ~1–2d, DVM improvements: ~2–3w)

Exit criteria: - [x] A-2: Columnar change tracking bitmask skips irrelevant rows; key column classification ✅, __pgt_key_changed annotation ✅, P5 value-only fast path ✅, DiffResult.has_key_changed signal propagation ✅, MERGE value-only UPDATE optimization ✅, upgrade script ✅ ✅ Done - [x] D-4: Shared buffer serves multiple STs via per-source changes_{oid} naming; pgt_change_tracking.tracked_by_pgt_ids reference counting; shared_buffer_stats() observability; property-based test with 5–10 consumers (3 properties, 500 cases) ✅ Done; 5 E2E fan-out tests - [x] PERF-2: buffer_partitioning = 'auto' activates RANGE(lsn) partitioned mode for high-throughput sources — throughput-based should_promote_inner() heuristic, convert_buffer_to_partitioned() runtime migration, 10 unit tests + 3 E2E tests, docs/CONFIGURATION.md updated ✅ Done - [x] A1-1b: Multi-column RANGE partition keys work end-to-end; composite ROW() predicate triggers partition pruning; 3 E2E tests + 5 unit tests ✅ Done - [x] A1-1c: alter_stream_table(partition_by => …) repartitions existing storage table without data loss; add/change/remove tested - [x] A1-1d: LIST partitioning creates PARTITION BY LIST storage; IN-list predicate injected; single-column-only validated; 4 E2E tests pass - [x] A1-3b: HASH partitioning uses per-partition MERGE loop; auto-creates N child partitions; satisfies_hash_partition() filter; 22 unit tests + 6 E2E tests ✅ Done - [x] PART-WARN: WARNING emitted when default partition has rows after refresh; warn_default_partition_growth() on both FULL and DIFFERENTIAL paths ✅ Done - [x] G14-MDED: Deduplication frequency profiling complete; TOTAL_DIFF_REFRESHES + DEDUP_NEEDED_REFRESHES shared-memory atomic counters; pgtrickle.dedup_stats() reports ratio; RFC threshold documented at ≥10% ✅ Done - [x] PROF-DLT: pgtrickle.explain_delta(st_name, format) function captures delta query plans in text/json/xml/yaml; PGS_PROFILE_DELTA=1 auto-capture to /tmp/delta_plans/; documented in SQL_REFERENCE.md ✅ Done - [x] C-3: Per-database worker quota enforced; tier-aware priority sort (IMMEDIATE > Hot > Warm > Cold) implemented; GUC + E2E quota tests added; compute_per_db_quota() with burst at 80% cluster load ✅ Done - [x] TPCH-½: TPCH_BENCH=1 mode emits [TPCH_BENCH] lines + summary table; just bench-tpch and bench-tpch-large targets functional ✅ Done - [x] TPCH-3: Five TPC-H OpTree Criterion benchmarks pass and run without a PostgreSQL backend ✅ Done - [x] DBT-½/3: partition_by, fuse, fuse_ceiling, fuse_sensitivity exposed in dbt macros; change detection wired; integration tests added; README and SQL_REFERENCE.md updated ✅ Done - [x] SQL-RECUR: Recursive CTE non-monotone audit complete; G1.3 downgraded to P4 — two Tier 3h E2E tests verify recomputation fallback is correct ✅ Done - [x] SQL-PG16-1: IS JSON predicate accepted in DIFFERENTIAL defining queries; E2E tests in e2e_expression_tests.rs confirm correct delta behaviour ✅ Done - [x] SQL-PG16-2: JSON_OBJECT, JSON_ARRAY, JSON_OBJECTAGG, JSON_ARRAYAGG accepted in DIFFERENTIAL defining queries; E2E tests in e2e_expression_tests.rs confirm correct delta behaviour ✅ Done - [x] scripts/check_upgrade_completeness.sh passes (all catalog changes in sql/pg_trickle--0.12.0--0.13.0.sql) ✅ Done — 58 functions, 8 new columns, all covered - [x] DI-8: is_algebraically_invertible() detects Expr::Raw("CASE …") and returns false for SUM(CASE WHEN …) (Q14 unaffected — ComplexExpression); Q12 removed from DIFFERENTIAL_SKIP_ALLOWLIST; 4 unit tests ✅ Done - [x] DI-9: scheduler_interval_ms cap raised to 600,000 ms; scheduler skips IMMEDIATE-mode tables in check_schedule(); verified safe for CALCULATED dependants ✅ Done - [x] DI-1: Named CTE L₀ snapshots implemented (NOT MATERIALIZED default, MATERIALIZED when ref ≥ 3); Q05/Q09 pass DIFFERENTIAL correctness ✅ Done - [x] DI-2: NOT EXISTS anti-join replaces EXCEPT ALL in build_pre_change_snapshot_sql(); per-leaf conditional EXCEPT ALL fallback when delta > max_delta_fraction; aggregate UPDATE-split blocked on Q12 drift root cause (DI-8 band-aid retained) ✅ Done - [x] DI-3: Already implemented — non-algebraic aggregate old rescan filtered via EXISTS (… IS NOT DISTINCT FROM …) to affected groups; NULL-safe ✅ Done - [x] DI-6: Semi-join R_old lazy materialization with key push-down; Q20 DIFF passes at SF=0.01 ✅ Done - [x] DI-4/5/7: R₀ cache (subset of DI-1), Part 3 threshold raised from 3→5, strategy selector + max_delta_fraction complete ✅ Done - [x] DI-10: bench-tpch-sf1 target added; 22/22 queries pass at SF=0.01 (3 cycles, zero drift) ✅ Done - [x] DI-11: Predicate pushdown enabled with scalar-subquery guard; deep-join L₀ threshold (4 scans); deep-join planner hints (5+ total scans); 22/22 TPC-H DIFFERENTIAL ✅ Done - [x] Extension upgrade path tested (0.12.0 → 0.13.0) ✅ Done

PGXN

PostgreSQL Extension Network

Contents