Plain-language companion: v0.22.0.md

v0.22.0 — Production Scalability & Downstream Integration

Status: ✅ Released. Driven by PLAN_OVERALL_ASSESSMENT.md P1 items not addressed in v0.21.0 and the highest-value P2 items.

Release Theme This release delivers the two highest-impact items from the overall assessment deferred from v0.21.0: a minimal-viable in-database parallel refresh worker pool (the single largest scalability unlock) and a downstream CDC publication so stream table changes can drive Kafka, Debezium, and event-sourcing pipelines without a second replication slot. Two P2 items ship alongside: a predictive cost model for adaptive refresh and SLA-driven tier auto-assignment. The transactional outbox helper moves to v0.24.0 where it ships alongside a companion inbox helper as a complete transactional messaging solution.

Downstream CDC Publication (P1 — §9.2)

In plain terms: pg_trickle consumes CDC from source tables but cannot emit changes downstream. This adds stream_table_to_publication() — a helper that exposes every row applied to a stream table as a PostgreSQL logical replication publication so Kafka Connect, Debezium, and event-sourcing pipelines can subscribe with zero code and no second replication slot.

Item Description Effort Ref
CDC-PUB-1 stream_table_to_publication(name TEXT) SQL function. Creates a logical replication publication for the target stream table using pgt_inserted_rows/pgt_deleted_rows output from the MERGE step. Catalog column downstream_publication_name tracks the association. 2–3d PLAN_OVERALL_ASSESSMENT.md §9.2
CDC-PUB-2 Lifecycle management. drop_stream_table_publication(name), auto-drop on drop_stream_table(), recreation on schema-change rebuild. 1d PLAN_OVERALL_ASSESSMENT.md §9.2
CDC-PUB-3 pg_stat_stream_tablesdownstream_publication column. Surface publication name (or NULL) in the monitoring view. 0.5d PLAN_OVERALL_ASSESSMENT.md §9.2
CDC-PUB-4 E2E tests. Create publication; verify subscriber receives insert/update/delete events; drop and verify cleanup. 1d PLAN_OVERALL_ASSESSMENT.md §9.2
CDC-PUB-5 Documentation. docs/SQL_REFERENCE.md section on downstream publications; tutorial showing Kafka Connect integration pattern. 1d

Downstream CDC publication subtotal: ~1–1.5 weeks

In-Database Parallel Refresh Worker Pool — Minimal Viable Slice (P1 — §3.1)

In plain terms: The scheduler today runs one refresh at a time per tick. This installs a dynamic bgworker pool — a coordinator owns the DAG, workers execute refreshes — so independent stream tables at the same DAG level refresh simultaneously. Deployments with 200+ STs or long refresh queues get immediate throughput gains. Opt-in via max_parallel_workers; default 0 preserves existing serial behaviour.

Item Description Effort Ref
PAR-1 Coordinator / worker process split. Coordinator BGW manages the tick; dispatches ready-to-run STs to a PgLwLock-protected shared work queue; worker BGWs pop entries and execute refresh transactions. 1.5–2wk plans/sql/PLAN_PARALLELISM.md §3
PAR-2 pg_trickle.max_parallel_workers GUC (default 0 = serial, range 0–32). Gate the entire parallel path so deployments can opt in incrementally. 1d plans/sql/PLAN_PARALLELISM.md §4
PAR-3 DAG level extraction. Re-use topological_levels() already in dag.rs to identify STs that can run concurrently (same level, no intra-level edges). 0.5d plans/sql/PLAN_PARALLELISM.md §3
PAR-4 Worker crash recovery. Coordinator marks the ST ERROR in pgt_refresh_history on worker crash (same behaviour as serial crash); respawns the worker slot. 1d plans/sql/PLAN_PARALLELISM.md §5
PAR-5 E2E tests: correctness + throughput. Diamond DAG with concurrent same-level refreshes; verify no partial-consistency window. Benchmark: wall-clock tick latency vs serial at 50-ST scale. 1d plans/sql/PLAN_PARALLELISM.md §6

Parallel refresh subtotal: ~3–4 weeks

Predictive Refresh Cost Model (P2 — §9.3)

In plain terms: The current adaptive threshold reacts after a slow differential refresh. This extends self-monitoring to predict duration_ms from rows_inserted + rows_deleted via linear regression over the last hour. When the forecast exceeds last_full_ms × 1.5, pg_trickle switches to FULL pre-emptively — eliminating the one-bad-cycle latency spike entirely.

Item Description Effort Ref
PRED-1 Linear regression forecaster. Fit duration_ms ~ delta_rows over pg_trickle.prediction_window minutes of pgt_refresh_history per ST. Expose fitted slope and intercept as columns in df_threshold_advice. 1–2d PLAN_OVERALL_ASSESSMENT.md §9.3
PRED-2 Pre-emptive FULL switch. If predicted_diff_ms > last_full_ms × pg_trickle.prediction_ratio (default 1.5), override strategy to FULL; log refresh_reason = 'predicted_cost_exceeds_full'. 1d PLAN_OVERALL_ASSESSMENT.md §9.3
PRED-3 Cold-start fallback. When fewer than pg_trickle.prediction_min_samples (default 5) history rows exist, fall back to the existing fixed-threshold logic. 0.5d
PRED-4 E2E test + proptest. Verify pre-emptive switch fires under synthetic cost spike; proptest checks cold-start fallback boundary (0–4 samples). 1d

Predictive cost model subtotal: ~1 week

SLA-Driven Tier Auto-Assignment (P2 — §9.7)

In plain terms: alter_stream_table(name, sla => interval '30 seconds') lets the scheduler pick the right tier automatically — no manual tier tuning required. Removes the expert-knowledge barrier to tiered scheduling.

Item Description Effort Ref
SLA-1 sla parameter on create_stream_table / alter_stream_table. Accepts an INTERVAL; stored as freshness_deadline_ms in pgt_stream_tables. 0.5d PLAN_OVERALL_ASSESSMENT.md §9.7
SLA-2 Initial tier assignment. On creation or alter_stream_table with sla set, assign to the tier whose dispatch_gap ≤ sla, considering current queue depth. 1d PLAN_OVERALL_ASSESSMENT.md §9.7
SLA-3 Dynamic re-assignment. After each tick, check whether the ST’s tier still meets the SLA given measured queue depth; bump one tier up or down if the gap is consistently exceeded or under-utilised by >2×. 1d PLAN_OVERALL_ASSESSMENT.md §9.7
SLA-4 E2E test. Create ST with 30 s SLA; inject artificial tick delay; verify tier promotion within 3 cycles. 0.5d

SLA-driven tier subtotal: ~3–4 days

Implementation Phases

Phase Description Duration
CDC-PUB Downstream CDC publication: SQL function, lifecycle, monitoring, tests, docs Days 1–8
PAR Parallel refresh: coordinator/worker split, GUC, DAG levels, recovery, tests Days 9–28
PRED Predictive cost model: regression, pre-emptive switch, cold-start fallback, tests Days 29–33
SLA SLA-driven tier: sla param, initial assignment, dynamic re-assignment, tests Days 34–37

v0.22.0 total: ~5 weeks (downstream CDC + parallel refresh + predictive cost + SLA tier)

Exit criteria: - [x] CDC-PUB-1: stream_table_to_publication(name) creates a working logical publication - [x] CDC-PUB-2: Publication is dropped automatically when the stream table is dropped - [x] CDC-PUB-3: downstream_publication column visible in pg_stat_stream_tables - [x] CDC-PUB-4: Subscriber receives correct insert/update/delete events; E2E test passes - [x] PAR-2: max_parallel_workers = 0 (default) produces identical results to serial mode - [x] PAR-1/PAR-3: max_parallel_workers ≥ 1 dispatches independent same-level STs concurrently - [x] PAR-4: Worker crash marks ST ERROR; coordinator respawns worker slot - [x] PAR-5: Diamond DAG concurrent correctness test passes; throughput improvement benchmarked - [x] PRED-1: Fitted coefficients visible in df_threshold_advice - [x] PRED-2: Pre-emptive FULL switch fires under synthetic spike; refresh_reason = 'predicted_cost_exceeds_full' logged - [x] PRED-3: Cold-start fallback active when fewer than prediction_min_samples history rows exist - [x] SLA-1: create_stream_table(..., sla => '30 seconds') stores freshness_deadline_ms - [x] SLA-2: Initial tier assignment matches SLA requirement on creation - [x] SLA-3: Tier auto-adjusts within 3 cycles when queue depth breaches SLA - [x] Extension upgrade path tested (0.21.0 → 0.22.0) - [x] just check-version-sync passes