Plain-language companion: v0.30.0.md

v0.30.0 — Pre-GA Correctness & Stability Sprint

Status: ✅ Released (v0.30.0). Derived from plans/PLAN_OVERALL_ASSESSMENT_3.md §3, §4, §7, §8. This release must land before v1.0.0 GA. Its purpose is to close every P0 and P1 gap identified in the v0.27.0 assessment so that the stable release inherits a clean correctness baseline.

Release Theme v0.30.0 is the quality gate before the feature-rich v0.28–v0.29 arc and the v1.0 stable release. It fixes the remaining correctness defects that could produce silent wrong answers (EC-01 phantom drift, snapshot non-atomicity), eliminates the operational failure modes that could surprise production operators (unbounded IVM/template caches, text-based SPI error classification, snapshot partial-restore data loss), closes the documentation gaps that block self-service operation (upgrade notes, GUC reference, ERRORS guide), and hardens the test suite with the fuzz and E2E coverage that the new SNAP/PLAN/CLUS/METR functionality currently lacks. No new user-visible SQL API is added. The release is a prerequisite for v1.0.0; any item not landed here will block GA.


Correctness

ID Title Effort Priority
CORR-1 Complete EC-01 phantom-row convergence M P0
CORR-2 Generalise SubLink detection to CASE/COALESCE/FuncCall S P1
CORR-3 Replace SELECT * EXCEPT with explicit column list in restore S P1

CORR-1 — Complete EC-01 phantom-row convergence. The v0.24.0 hash fix is necessary but not sufficient: is_deduplicated: false at src/dvm/operators/join.rs:657–668 still forces the MERGE to aggregate by row-id, and the conditional wiring that invokes src/refresh/phd1.rs cross-cycle cleanup is incomplete. Option A: wire every refresh cycle through unconditional PH-D1 cleanup with a small batch size. Option B: re-derive the Part-2 row-id from the retained base-table key snapshot so that is_deduplicated: true can be set for INNER joins. Verify with the deterministic reproducer from test_tpch_q07_ec01b_combined_delete and the IMMEDIATE-mode property tests. Schema change: No. Dependencies: v0.24.0 PH-D1 skeleton.

CORR-2 — SubLink detection for CASE/COALESCE/FuncCall. node_tree_contains_sublink only recurses into T_BoolExpr arguments; SubLinks inside T_CaseExpr, T_CoalesceExpr, or function-argument lists silently downgrade the query to FULL refresh without a user-visible message. Generalise the walker or emit a NOTICE on forced downgrade. Schema change: No. Dependencies: None.

CORR-3 — Explicit column list for restore. restore_from_snapshot_impl unconditionally emits SELECT * EXCEPT(…), which is only available on PG 18 with specific minor-version patches. Replace with a pg_attribute catalog walk that builds the explicit column list, eliminating PG-minor sensitivity entirely. Schema change: No. Dependencies: None.


Stability

ID Title Effort Priority
STAB-1 Wrap snapshot/restore in SubTransaction RAII + exclusive lock M P0
STAB-2 Bound IVM_DELTA_CACHE via clock-style eviction S P1
STAB-3 Age-based purge for L2 template-cache catalog table S P1
STAB-4 Surface snapshot catalog INSERT failure as WARNING XS P0
STAB-5 Replace wal_decoder.rs .expect() with c"test_decoding" XS P1
STAB-6 Clear IVM thread-local cache via XactCallback on subxact abort S P1

STAB-1 — Snapshot/restore atomicity. snapshot_stream_table_impl and restore_from_snapshot_impl (src/api/snapshot.rs:90–260) run independent SPI statements with no subxact bracket. A backend crash between the CREATE TABLE AS and the catalog INSERT leaves an orphan table; a mid-restore statement failure leaves the storage table truncated. Wrap the entire operation in the same SubTransaction RAII helper used by src/scheduler.rs:250–308. For restore, open with LOCK TABLE … IN ACCESS EXCLUSIVE MODE before the TRUNCATE. Propagate snapshot-version-check failure as typed SnapshotSchemaVersionMismatch instead of silently treating None as compatible. Schema change: No. Dependencies: scheduler SubTransaction helper (shipped v0.27.0).

STAB-2 — Bound IVM_DELTA_CACHE. The thread-local IVM_DELTA_CACHE at src/ivm.rs:107–143 has no eviction. Implement a clock-style eviction that respects pg_trickle.template_cache_max_entries — the same GUC already used by the L1 delta-template cache. Schema change: No. Dependencies: None.

STAB-3 — L2 template-cache age purge. src/template_cache.rs writes one row per stream table but never purges stale entries from ALTER QUERY without DROP or source-OID renumbering. Add a cached_at TIMESTAMPTZ column (migration) and a lightweight batched DELETE in the scheduler’s launcher tick, age-bounded by a new pg_trickle.template_cache_max_age_hours GUC (default 168 h = 7 days). Schema change: Yes — ALTER TABLE pgtrickle.pgt_template_cache ADD COLUMN cached_at TIMESTAMPTZ NOT NULL DEFAULT now().

STAB-4 — Snapshot catalog INSERT failure warning. snapshot_stream_table_impl discards the Result of the catalog INSERT INTO pgtrickle.pgt_snapshots with a // best-effort comment at src/api/snapshot.rs:152–167. A silent failure produces a function return value (the snapshot path) that list_snapshots() will never find. Promote the failure to a pgrx WARNING message. Schema change: No.

STAB-5 — wal_decoder.rs .expect() removal. CString::new("test_decoding").expect(…) at src/wal_decoder.rs:307 is the last production .expect() outside the sound unreachable!() post-report() calls. Replace with c"test_decoding" (compile-time CStr literal; Rust 1.77+ stable) to eliminate the unreachable runtime path entirely. Schema change: No.

STAB-6 — IVM cache XactCallback. If an IVM trigger function fails mid-statement, the __pgt_newtable_<oid> / __pgt_oldtable_<oid> temp tables are cleaned up by PG subxact abort, but the thread-local IVM_DELTA_CACHE is not cleared. A stale entry can survive a failed apply and be reused in the next statement. Register a XactCallback that calls invalidate_ivm_delta_cache on XACT_EVENT_ABORT_SUB. Schema change: No.


Performance

ID Title Effort Priority
PERF-1 Implement L0 shared-shmem dshash template cache L P1
PERF-2 Bound parser memory via max_parse_nodes GUC + XactCallback M P1
PERF-3 Add missing benchmark suite (SNAP, PLAN, CLUS, IVM apply, WAL decoder) M P1
PERF-4 Reduce catalog hot-path in metrics_summary() / cluster_worker_summary() M P2

PERF-1 — L0 dshash shared-memory template cache. src/shmem.rs:680–710 wires L0_POPULATED_VERSION as a signal that the L2 catalog cache was populated, but the actual dshash data structure that would store delta SQL in shared memory is not implemented. Build the dshash store so that any backend can satisfy a cache lookup from shared memory rather than paying the ~1 ms L2 catalog SELECT on every cold start. Expected win: erase the remaining cold-backend latency tail in high-backend-count deployments. Schema change: No. Dependencies: pg_dsm_create and dshash_create pgrx bindings (available in pgrx 0.18).

PERF-2 — Parser memory bounds. pg_trickle.max_parse_depth (G13-SD) bounds recursion depth but not total node count. A large IN (1, …, 1_000_000) list still allocates unboundedly in PARSE_ADVISORY_WARNINGS and cte_ctx.registry. Add a pg_trickle.max_parse_nodes GUC (default 100 000) and reject queries that exceed it with QueryTooComplex. Clear thread-locals at end-of-statement via XactCallback. Schema change: No (new GUC only).

PERF-3 — Missing benchmarks. Seven performance-critical paths have no Criterion coverage: snapshot/restore round-trip, predictive planner recommend_schedule at varying history lengths, cluster_worker_summary() at scale, L2 template-cache hit/miss latency, IVM apply path, WAL decoder poll loop, multi-database fairness. Add a benchmark file or extend benches/refresh_bench.rs to cover all seven. Schema change: No.

PERF-4 — Reduce catalog hot-path in metrics_summary() and cluster_worker_summary(). Both new v0.27.0 functions re-query pgtrickle.pgt_stream_tables and pg_stat_activity on every Prometheus scrape (default 5 s interval). At 200+ stream tables this adds a full catalog scan per database every 5 s, on top of the scheduler’s own reads. Introduce a short-lived shmem-backed snapshot for these read paths that both surfaces share, refreshed once per scheduler tick. Controlled by pg_trickle.metrics_catalog_snapshot_ttl_ms GUC (default 4000 ms). Schema change: No.


Scalability

ID Title Effort Priority
SCAL-1 Replace text-based SPI error classification with SQLSTATE codes M P0
SCAL-2 Replace refresh::* blanket re-exports with explicit pub use XS P1
SCAL-3 Remove dead #[allow(unused_imports)] shims in refresh/orchestrator.rs XS P2

SCAL-1 — SQLSTATE-based error classification. classify_spi_error_retryable at src/error.rs:213–260 matches English text fragments. On a PostgreSQL build with non-English lc_messages, every pattern silently breaks and every SPI error becomes retryable. Push SQLSTATE through pgrx (pg_sys::ErrorData.sqlerrcode) and classify by 5-character code. Schema change: No. Dependencies: pgrx 0.18 ErrorData access.

SCAL-2 — Explicit pub use in refresh/mod.rs. src/refresh/mod.rs exposes pub use codegen::*; pub use merge::*; pub use orchestrator::*;, promoting every new public symbol automatically. Convert to explicit re-export lists to enforce module boundary discipline without breaking callers. Schema change: No.

SCAL-3 — Remove dead #[allow(unused_imports)] in refresh/orchestrator.rs. Every import at src/refresh/orchestrator.rs:6–26 is annotated #[allow(unused_imports)] — a maintenance hazard left over from the ARCH-1B split. Removing an actually-unused import won’t surface a warning, and the imports become silently misleading. Convert each use to a concrete symbol and remove the lint-suppression attribute. Schema change: No.


Ease of Use

ID Title Effort Priority
UX-1 Document all v0.27.0 GUCs in CONFIGURATION.md S P0
UX-2 Document v0.15.0–v0.27.0 upgrade notes in UPGRADING.md M P0
UX-3 Document new error variants (HINT/DETAIL) in ERRORS.md S P1
UX-4 Complete SQL_REFERENCE.md for SNAP/PLAN/CLUS/METR functions S P1
UX-5 Add TUI parity for SNAP/PLAN/CLUS/METR functions M P1
UX-6 Ship first-party Grafana dashboard JSON in monitoring/grafana/ M P1
UX-7 Document pg_trickle_dump in BACKUP_AND_RESTORE.md XS P1
UX-8 Add snapshot/PITR walkthrough to GETTING_STARTED.md S P1
UX-9 Document change_buffer_durability + frontier_holdback_* in PRE_DEPLOYMENT.md XS P1
UX-10 Add atomicity-gap warning + concurrent-restore note to BACKUP_AND_RESTORE.md XS P1
UX-11 Add FAQ entries: snapshot how-to, schedule_recommendation_min_samples tuning XS P2
UX-12 Add RELEASE.md checklist item for SNAP/PLAN/CLUS/METR migration script XS P2
UX-13 Add snapshot demo to PLAYGROUND.md XS P2
UX-14 Track CNPG 1.29 compatibility: update cnpg/cluster-example.yaml and CI S P1
UX-15 Open tracking issue for dbt-core 1.11 upgrade path XS P1
UX-16 Add noisy-neighbour example + alert to docs/integrations/multi-tenant.md XS P3
UX-17 Add recommend_schedule tuning recipe to PERFORMANCE_COOKBOOK.md XS P3
UX-18 Annotate dbt-pgtrickle/README.md with compatible extension version range XS P3

UX-1schedule_recommendation_min_samples (default 20), schedule_alert_cooldown_seconds (default 300), metrics_request_timeout_ms (default 5000), change_buffer_durability (unlogged|logged|sync), and several others from Appendix C of the assessment are missing from the CONFIGURATION.md TOC. Add full Property/Default/Range/Context + Tuning Guidance sections for each.

UX-2UPGRADING.md stops at 0.13.0 → 0.14.0. Add upgrade notes (breaking changes, migration script highlights, GUC renames) for every release from v0.15.0 through v0.27.0.

UX-3 — New error variants shipped since v0.23.0 (SnapshotAlreadyExists, SnapshotSourceNotFound, SnapshotSchemaVersionMismatch, DiagnosticError, PublicationAlreadyExists, PublicationNotFound, PublicationRebuildFailed, SlaTooSmall, ChangedColsBitmaskFailed) have no HINT/DETAIL entries in ERRORS.md.

UX-4 — Spot-check confirms cluster_worker_summary() and metrics_summary() are referenced from SCALING.md but do not have full sections in SQL_REFERENCE.md. Add complete parameter/return-type/example entries for all eight SNAP/PLAN/CLUS/METR functions.

UX-5pgtrickle-tui has no panels for snapshot_stream_table, restore_from_snapshot, list_snapshots, recommend_schedule, schedule_recommendations, cluster_worker_summary, metrics_summary. Add SNAP/PLAN/CLUS/METR views to the TUI alongside the existing stream-table management screens.

UX-6 — The monitoring/grafana/ directory has no first-party dashboard JSON. The dashboard snippets in docs/integrations/multi-tenant.md are the only artefact. Ship a baseline dashboard covering refresh latency, CDC buffer growth, IVM lock-mode distribution, and per-DB cluster metrics.

UX-7src/bin/pg_trickle_dump.rs (458 LOC) is the only out-of-database backup tool. Document usage, flags, and restore procedure in BACKUP_AND_RESTORE.md.

UX-8 — Add a “Snapshot and Point-in-Time Recovery” section to GETTING_STARTED.md with a worked example: take a snapshot, simulate data loss, restore from snapshot, verify.

UX-9PRE_DEPLOYMENT.md does not mention change_buffer_durability (unlogged|logged|sync, default unlogged) or the frontier_holdback_lsn_bytes gauge. These are operationally critical: operators setting sync for durability need to know the write-amplification cost before deployment. Schema change: No.

UX-10BACKUP_AND_RESTORE.md does not warn about the atomicity gap in snapshot_stream_table / restore_from_snapshot identified in §3.2 (no subxact bracket; crash between CREATE TABLE AS and catalog INSERT leaves an orphan table; mid-restore TRUNCATE + interrupted INSERT leaves the storage table truncated). Add an explicit “Known Limitations” callout until STAB-1 is fixed. Also document that restore_from_snapshot should not be called while a concurrent refresh_stream_table is in flight. Schema change: No.

UX-11FAQ.md has no entry for “How do I take a snapshot?” or “How do I tune schedule_recommendation_min_samples?”. Add both with links to BACKUP_AND_RESTORE.md and CONFIGURATION.md. Schema change: No.

UX-12RELEASE.md checklist should include a step to verify the SNAP/PLAN/CLUS/METR migration script (sql/pg_trickle--0.26.0--0.27.0.sql) passes scripts/check_upgrade_completeness.sh. Add to the release runbook. Schema change: No.

UX-13PLAYGROUND.md has no snapshot demo. Add a snapshot_stream_table → data loss simulation → restore_from_snapshot example that works in the playground/docker-compose.yml environment. Schema change: No.

UX-14cnpg/cluster-example.yaml targets CNPG 1.28+. CNPG 1.29 ships in 2026-Q2. Refresh the manifest, add a CI job using the updated image, and verify the ImageVolume mount still works under the new CNPG operator. Schema change: No.

UX-15dbt-pgtrickle/AGENTS.md pins dbt-core ~=1.10 and notes Python 3.13 only (mashumaro dep can’t build on 3.14). dbt-core 1.11 is in beta. Open a tracking GitHub issue: note the mashumaro blocker, watch the 1.11 release, and add a CI job testing against 1.11 once the dep resolves. Schema change: No.

UX-16docs/integrations/multi-tenant.md is excellent but has no worked example of a noisy-neighbour scenario. Add one: show a high-frequency stream table starving others, the Prometheus alert expression, and the remediation step (pg_trickle.per_db_refresh_quota_ms). Schema change: No.

UX-17PERFORMANCE_COOKBOOK.md should include a recipe: “Use recommend_schedule to right-size a 100-table dbt project.” Walk through a cold-start, initial history collection, and the first auto-recommendation. Link to CONFIGURATION.md for the relevant GUCs. Schema change: No.

UX-18dbt-pgtrickle/README.md does not state which extension version it is tested against. Add a compatibility matrix row (e.g. dbt-pgtrickle 0.5.xpg_trickle ≥0.25.0) so users know what to pin. Schema change: No.


Test Coverage

ID Title Effort Priority
TEST-1 E2E: snapshot atomicity under crash (orphan table detection) M P0
TEST-2 E2E: snapshot version mismatch S P1
TEST-3 E2E: predictive planner with N < min_samples S P1
TEST-4 E2E: multi-DB worker fairness under contention M P1
TEST-5 Integration: IVM_DELTA_CACHE bounded over 1000 ALTER QUERY cycles S P1
TEST-6 Integration: classify_spi_error_retryable with lc_messages=fr_FR S P0
TEST-7 Fuzz: WAL decoder M P1
TEST-8 Fuzz: MERGE template generator M P1
TEST-9 Fuzz: snapshot SQL builder S P1
TEST-10 Fuzz: DAG SCC graph shapes S P2
TEST-11 EC-01 deterministic reproducer (replaces flaky test_tpch_q07_*) M P0
TEST-12 E2E: snapshot under concurrent refresh_stream_table S P1
TEST-13 E2E: snapshot → pg_dumppg_restore round-trip M P1
TEST-14 Promote G17-MDB multi-database soak test from stability-tests.yml to ci.yml S P1
TEST-15 E2E: WAL decoder failure injection (slot missing / wrong plugin) S P2
TEST-16 E2E: statement-level CDC with mixed INSERT/UPDATE/DELETE ordering invariants S P2
TEST-17 E2E: restore_from_snapshot rollback on mid-INSERT constraint violation S P2
TEST-18 Property: cluster_worker_summary consistency under crash (proptest) S P2
TEST-19 Property: predictive planner monotone in history length (proptest) S P2

Conflicts & Risks

  • STAB-3 requires a schema migration (adding cached_at column). Include in sql/pg_trickle--0.29.0--0.30.0.sql; verify via scripts/check_upgrade_completeness.sh.
  • CORR-1 touches the DVM join delta pipeline — the highest-risk module. Must be gated behind full TPC-H property test suite before merge.
  • PERF-1 requires dshash pgrx bindings that may not yet be fully stabilised in pgrx 0.18; de-risk with a spike before committing to the milestone.
  • SCAL-1 changes observable retry semantics. Roll out behind a pg_trickle.use_sqlstate_classification GUC (default false) initially, flip to true in v0.31.0 after one release of parallel validation.
  • UX-14 (CNPG 1.29) depends on CNPG 1.29 being available; if the release slips past the v0.30.0 window, defer to v1.0.0 but keep the tracking issue open.
  • UX-15 (dbt 1.11) is purely a tracking concern — no code changes are required in this release.

Implementation Phases

Phase Description Duration
Phase 1 P0 correctness + stability: CORR-1, STAB-1, STAB-4, SCAL-1 Days 1–6
Phase 2 P0 docs: UX-1, UX-2, TEST-6, TEST-11 Days 6–9
Phase 3 P1 stability + safety: STAB-2, STAB-3, STAB-5, STAB-6, CORR-2, CORR-3 Days 9–14
Phase 4 P1 architecture: SCAL-2, PERF-1, PERF-2 Days 14–19
Phase 5 P1/P2 test coverage: TEST-1 through TEST-19, PERF-3 Days 19–27
Phase 6 P1/P2/P3 docs & UX: UX-3 through UX-18, TUI parity Days 27–34

v0.30.0 total: ~7–8 weeks (correctness-critical path ~10 days; documentation and test coverage ~14 days; architecture improvements ~8 days; new test targets ~8 days; Low-priority doc polish ~4 days)

Exit criteria: - [x] CORR-1: test_tpch_q07_ec01b_combined_delete passes reliably over 50 runs; IMMEDIATE-mode property tests show zero phantom drift - [x] CORR-2: Parser emits NOTICE when SubLink inside CASE/COALESCE forces FULL downgrade - [x] CORR-3: restore_from_snapshot uses explicit column list; works on PG 18.0 and 18.x - [x] STAB-1: Snapshot under crash leaves no orphan tables (TEST-1); restore under concurrent refresh is safe - [x] STAB-2: IVM_DELTA_CACHE size bounded by template_cache_max_entries (TEST-5) - [x] STAB-3: L2 template-cache catalog table purged by scheduler tick; cached_at column present in migration - [x] STAB-4: snapshot_stream_table emits WARNING when catalog INSERT fails - [x] STAB-5: wal_decoder.rs:307 uses c"test_decoding"; just lint clean - [x] STAB-6: IVM cache cleared on subxact abort via XactCallback - [x] SCAL-1: Retry classification by SQLSTATE; lc_messages=fr_FR test passes (TEST-6) - [ ] PERF-3: All seven missing benchmarks present and passing Criterion baseline gate (deferred to v0.31.0) - [x] UX-1: All Appendix C GUCs documented in CONFIGURATION.md - [x] UX-2: UPGRADING.md covers v0.15.0 → v0.27.0 - [ ] UX-3: All new error variants have HINT/DETAIL in ERRORS.md (deferred to v0.31.0) - [ ] UX-4: All eight SNAP/PLAN/CLUS/METR functions have full SQL_REFERENCE entries (deferred to v0.31.0) - [ ] UX-5: TUI shows SNAP/PLAN/CLUS/METR panels (deferred to v0.31.0) - [ ] UX-6: monitoring/grafana/pg_trickle.json ships and renders in Grafana 11+ (deferred to v0.31.0) - [ ] UX-9: PRE_DEPLOYMENT.md documents change_buffer_durability and frontier_holdback_lsn_bytes (deferred to v0.31.0) - [ ] UX-10: BACKUP_AND_RESTORE.md Known Limitations section added; concurrent-restore warning present (deferred to v0.31.0) - [ ] UX-14: cnpg/cluster-example.yaml updated to CNPG 1.29; CI job green (deferred to v0.31.0) - [ ] UX-15: dbt-core 1.11 tracking issue opened with mashumaro blocker documented (deferred to v0.31.0) - [ ] TEST-7/8/9: New fuzz targets run in CI; no crashes after 24 h (deferred to v0.31.0) - [ ] TEST-12: Snapshot under concurrent refresh test passes (deferred to v0.31.0) - [ ] TEST-13: pg_dumppg_restore round-trip test passes for snapshot tables (deferred to v0.31.0) - [ ] TEST-14: G17-MDB multi-database soak test runs in ci.yml on every push to main (deferred to v0.31.0) - [x] SCAL-3: Dead #[allow(unused_imports)] removed from refresh/orchestrator.rs; just lint clean - [x] Extension upgrade path tested (0.29.0 → 0.30.0) - [x] just check-version-sync passes