Contents
Plain-language companion: v0.30.0.md
v0.30.0 — Pre-GA Correctness & Stability Sprint
Status: ✅ Released (v0.30.0). Derived from plans/PLAN_OVERALL_ASSESSMENT_3.md §3, §4, §7, §8. This release must land before v1.0.0 GA. Its purpose is to close every P0 and P1 gap identified in the v0.27.0 assessment so that the stable release inherits a clean correctness baseline.
Release Theme v0.30.0 is the quality gate before the feature-rich v0.28–v0.29 arc and the v1.0 stable release. It fixes the remaining correctness defects that could produce silent wrong answers (EC-01 phantom drift, snapshot non-atomicity), eliminates the operational failure modes that could surprise production operators (unbounded IVM/template caches, text-based SPI error classification, snapshot partial-restore data loss), closes the documentation gaps that block self-service operation (upgrade notes, GUC reference, ERRORS guide), and hardens the test suite with the fuzz and E2E coverage that the new SNAP/PLAN/CLUS/METR functionality currently lacks. No new user-visible SQL API is added. The release is a prerequisite for
v1.0.0; any item not landed here will block GA.
Correctness
| ID | Title | Effort | Priority |
|---|---|---|---|
| CORR-1 | Complete EC-01 phantom-row convergence | M | P0 |
| CORR-2 | Generalise SubLink detection to CASE/COALESCE/FuncCall | S | P1 |
| CORR-3 | Replace SELECT * EXCEPT with explicit column list in restore |
S | P1 |
CORR-1 — Complete EC-01 phantom-row convergence.
The v0.24.0 hash fix is necessary but not sufficient: is_deduplicated:
false at src/dvm/operators/join.rs:657–668 still forces the MERGE to
aggregate by row-id, and the conditional wiring that invokes
src/refresh/phd1.rs cross-cycle cleanup is incomplete. Option A:
wire every refresh cycle through unconditional PH-D1 cleanup with a small
batch size. Option B: re-derive the Part-2 row-id from the retained
base-table key snapshot so that is_deduplicated: true can be set for
INNER joins. Verify with the deterministic reproducer from
test_tpch_q07_ec01b_combined_delete and the IMMEDIATE-mode property
tests. Schema change: No. Dependencies: v0.24.0 PH-D1 skeleton.
CORR-2 — SubLink detection for CASE/COALESCE/FuncCall.
node_tree_contains_sublink only recurses into T_BoolExpr arguments;
SubLinks inside T_CaseExpr, T_CoalesceExpr, or function-argument lists
silently downgrade the query to FULL refresh without a user-visible
message. Generalise the walker or emit a NOTICE on forced downgrade.
Schema change: No. Dependencies: None.
CORR-3 — Explicit column list for restore.
restore_from_snapshot_impl unconditionally emits SELECT * EXCEPT(…),
which is only available on PG 18 with specific minor-version patches.
Replace with a pg_attribute catalog walk that builds the explicit column
list, eliminating PG-minor sensitivity entirely. Schema change: No.
Dependencies: None.
Stability
| ID | Title | Effort | Priority |
|---|---|---|---|
| STAB-1 | Wrap snapshot/restore in SubTransaction RAII + exclusive lock | M | P0 |
| STAB-2 | Bound IVM_DELTA_CACHE via clock-style eviction |
S | P1 |
| STAB-3 | Age-based purge for L2 template-cache catalog table | S | P1 |
| STAB-4 | Surface snapshot catalog INSERT failure as WARNING | XS | P0 |
| STAB-5 | Replace wal_decoder.rs .expect() with c"test_decoding" |
XS | P1 |
| STAB-6 | Clear IVM thread-local cache via XactCallback on subxact abort |
S | P1 |
STAB-1 — Snapshot/restore atomicity.
snapshot_stream_table_impl and restore_from_snapshot_impl
(src/api/snapshot.rs:90–260) run independent SPI statements with no
subxact bracket. A backend crash between the CREATE TABLE AS and the
catalog INSERT leaves an orphan table; a mid-restore statement failure
leaves the storage table truncated. Wrap the entire operation in the same
SubTransaction RAII helper used by src/scheduler.rs:250–308. For
restore, open with LOCK TABLE … IN ACCESS EXCLUSIVE MODE before the
TRUNCATE. Propagate snapshot-version-check failure as typed
SnapshotSchemaVersionMismatch instead of silently treating None as
compatible. Schema change: No. Dependencies: scheduler SubTransaction
helper (shipped v0.27.0).
STAB-2 — Bound IVM_DELTA_CACHE.
The thread-local IVM_DELTA_CACHE at src/ivm.rs:107–143 has no eviction.
Implement a clock-style eviction that respects pg_trickle.template_cache_max_entries
— the same GUC already used by the L1 delta-template cache.
Schema change: No. Dependencies: None.
STAB-3 — L2 template-cache age purge.
src/template_cache.rs writes one row per stream table but never purges
stale entries from ALTER QUERY without DROP or source-OID renumbering. Add a
cached_at TIMESTAMPTZ column (migration) and a lightweight batched DELETE in
the scheduler’s launcher tick, age-bounded by a new
pg_trickle.template_cache_max_age_hours GUC (default 168 h = 7 days).
Schema change: Yes — ALTER TABLE pgtrickle.pgt_template_cache ADD COLUMN
cached_at TIMESTAMPTZ NOT NULL DEFAULT now().
STAB-4 — Snapshot catalog INSERT failure warning.
snapshot_stream_table_impl discards the Result of the catalog INSERT INTO
pgtrickle.pgt_snapshots with a // best-effort comment at
src/api/snapshot.rs:152–167. A silent failure produces a function return
value (the snapshot path) that list_snapshots() will never find. Promote the
failure to a pgrx WARNING message. Schema change: No.
STAB-5 — wal_decoder.rs .expect() removal.
CString::new("test_decoding").expect(…) at src/wal_decoder.rs:307 is the
last production .expect() outside the sound unreachable!() post-report()
calls. Replace with c"test_decoding" (compile-time CStr literal; Rust 1.77+
stable) to eliminate the unreachable runtime path entirely. Schema change: No.
STAB-6 — IVM cache XactCallback.
If an IVM trigger function fails mid-statement, the
__pgt_newtable_<oid> / __pgt_oldtable_<oid> temp tables are cleaned up by
PG subxact abort, but the thread-local IVM_DELTA_CACHE is not cleared. A stale
entry can survive a failed apply and be reused in the next statement. Register a
XactCallback that calls invalidate_ivm_delta_cache on
XACT_EVENT_ABORT_SUB. Schema change: No.
Performance
| ID | Title | Effort | Priority |
|---|---|---|---|
| PERF-1 | Implement L0 shared-shmem dshash template cache | L | P1 |
| PERF-2 | Bound parser memory via max_parse_nodes GUC + XactCallback |
M | P1 |
| PERF-3 | Add missing benchmark suite (SNAP, PLAN, CLUS, IVM apply, WAL decoder) | M | P1 |
| PERF-4 | Reduce catalog hot-path in metrics_summary() / cluster_worker_summary() |
M | P2 |
PERF-1 — L0 dshash shared-memory template cache.
src/shmem.rs:680–710 wires L0_POPULATED_VERSION as a signal that the L2
catalog cache was populated, but the actual dshash data structure that would
store delta SQL in shared memory is not implemented. Build the dshash store
so that any backend can satisfy a cache lookup from shared memory rather than
paying the ~1 ms L2 catalog SELECT on every cold start. Expected win: erase
the remaining cold-backend latency tail in high-backend-count deployments.
Schema change: No. Dependencies: pg_dsm_create and dshash_create
pgrx bindings (available in pgrx 0.18).
PERF-2 — Parser memory bounds.
pg_trickle.max_parse_depth (G13-SD) bounds recursion depth but not total
node count. A large IN (1, …, 1_000_000) list still allocates unboundedly in
PARSE_ADVISORY_WARNINGS and cte_ctx.registry. Add a
pg_trickle.max_parse_nodes GUC (default 100 000) and reject queries that
exceed it with QueryTooComplex. Clear thread-locals at end-of-statement via
XactCallback. Schema change: No (new GUC only).
PERF-3 — Missing benchmarks.
Seven performance-critical paths have no Criterion coverage: snapshot/restore
round-trip, predictive planner recommend_schedule at varying history lengths,
cluster_worker_summary() at scale, L2 template-cache hit/miss latency, IVM
apply path, WAL decoder poll loop, multi-database fairness. Add a benchmark
file or extend benches/refresh_bench.rs to cover all seven.
Schema change: No.
PERF-4 — Reduce catalog hot-path in metrics_summary() and cluster_worker_summary().
Both new v0.27.0 functions re-query pgtrickle.pgt_stream_tables and
pg_stat_activity on every Prometheus scrape (default 5 s interval). At
200+ stream tables this adds a full catalog scan per database every 5 s, on
top of the scheduler’s own reads. Introduce a short-lived shmem-backed
snapshot for these read paths that both surfaces share, refreshed once per
scheduler tick. Controlled by pg_trickle.metrics_catalog_snapshot_ttl_ms
GUC (default 4000 ms). Schema change: No.
Scalability
| ID | Title | Effort | Priority |
|---|---|---|---|
| SCAL-1 | Replace text-based SPI error classification with SQLSTATE codes | M | P0 |
| SCAL-2 | Replace refresh::* blanket re-exports with explicit pub use |
XS | P1 |
| SCAL-3 | Remove dead #[allow(unused_imports)] shims in refresh/orchestrator.rs |
XS | P2 |
SCAL-1 — SQLSTATE-based error classification.
classify_spi_error_retryable at src/error.rs:213–260 matches English text
fragments. On a PostgreSQL build with non-English lc_messages, every pattern
silently breaks and every SPI error becomes retryable. Push SQLSTATE through
pgrx (pg_sys::ErrorData.sqlerrcode) and classify by 5-character code.
Schema change: No. Dependencies: pgrx 0.18 ErrorData access.
SCAL-2 — Explicit pub use in refresh/mod.rs.
src/refresh/mod.rs exposes pub use codegen::*; pub use merge::*; pub use
orchestrator::*;, promoting every new public symbol automatically. Convert to
explicit re-export lists to enforce module boundary discipline without breaking
callers. Schema change: No.
SCAL-3 — Remove dead #[allow(unused_imports)] in refresh/orchestrator.rs.
Every import at src/refresh/orchestrator.rs:6–26 is annotated
#[allow(unused_imports)] — a maintenance hazard left over from the ARCH-1B
split. Removing an actually-unused import won’t surface a warning, and the
imports become silently misleading. Convert each use to a concrete symbol and
remove the lint-suppression attribute. Schema change: No.
Ease of Use
| ID | Title | Effort | Priority |
|---|---|---|---|
| UX-1 | Document all v0.27.0 GUCs in CONFIGURATION.md |
S | P0 |
| UX-2 | Document v0.15.0–v0.27.0 upgrade notes in UPGRADING.md |
M | P0 |
| UX-3 | Document new error variants (HINT/DETAIL) in ERRORS.md |
S | P1 |
| UX-4 | Complete SQL_REFERENCE.md for SNAP/PLAN/CLUS/METR functions |
S | P1 |
| UX-5 | Add TUI parity for SNAP/PLAN/CLUS/METR functions | M | P1 |
| UX-6 | Ship first-party Grafana dashboard JSON in monitoring/grafana/ |
M | P1 |
| UX-7 | Document pg_trickle_dump in BACKUP_AND_RESTORE.md |
XS | P1 |
| UX-8 | Add snapshot/PITR walkthrough to GETTING_STARTED.md |
S | P1 |
| UX-9 | Document change_buffer_durability + frontier_holdback_* in PRE_DEPLOYMENT.md |
XS | P1 |
| UX-10 | Add atomicity-gap warning + concurrent-restore note to BACKUP_AND_RESTORE.md |
XS | P1 |
| UX-11 | Add FAQ entries: snapshot how-to, schedule_recommendation_min_samples tuning |
XS | P2 |
| UX-12 | Add RELEASE.md checklist item for SNAP/PLAN/CLUS/METR migration script | XS | P2 |
| UX-13 | Add snapshot demo to PLAYGROUND.md |
XS | P2 |
| UX-14 | Track CNPG 1.29 compatibility: update cnpg/cluster-example.yaml and CI |
S | P1 |
| UX-15 | Open tracking issue for dbt-core 1.11 upgrade path | XS | P1 |
| UX-16 | Add noisy-neighbour example + alert to docs/integrations/multi-tenant.md |
XS | P3 |
| UX-17 | Add recommend_schedule tuning recipe to PERFORMANCE_COOKBOOK.md |
XS | P3 |
| UX-18 | Annotate dbt-pgtrickle/README.md with compatible extension version range |
XS | P3 |
UX-1 — schedule_recommendation_min_samples (default 20),
schedule_alert_cooldown_seconds (default 300), metrics_request_timeout_ms
(default 5000), change_buffer_durability (unlogged|logged|sync), and
several others from Appendix C of the assessment are missing from the
CONFIGURATION.md TOC. Add full Property/Default/Range/Context + Tuning
Guidance sections for each.
UX-2 — UPGRADING.md stops at 0.13.0 → 0.14.0. Add upgrade notes
(breaking changes, migration script highlights, GUC renames) for every
release from v0.15.0 through v0.27.0.
UX-3 — New error variants shipped since v0.23.0 (SnapshotAlreadyExists,
SnapshotSourceNotFound, SnapshotSchemaVersionMismatch, DiagnosticError,
PublicationAlreadyExists, PublicationNotFound, PublicationRebuildFailed,
SlaTooSmall, ChangedColsBitmaskFailed) have no HINT/DETAIL entries in
ERRORS.md.
UX-4 — Spot-check confirms cluster_worker_summary() and
metrics_summary() are referenced from SCALING.md but do not have full
sections in SQL_REFERENCE.md. Add complete parameter/return-type/example
entries for all eight SNAP/PLAN/CLUS/METR functions.
UX-5 — pgtrickle-tui has no panels for snapshot_stream_table,
restore_from_snapshot, list_snapshots, recommend_schedule,
schedule_recommendations, cluster_worker_summary, metrics_summary.
Add SNAP/PLAN/CLUS/METR views to the TUI alongside the existing stream-table
management screens.
UX-6 — The monitoring/grafana/ directory has no first-party dashboard
JSON. The dashboard snippets in docs/integrations/multi-tenant.md are the
only artefact. Ship a baseline dashboard covering refresh latency, CDC buffer
growth, IVM lock-mode distribution, and per-DB cluster metrics.
UX-7 — src/bin/pg_trickle_dump.rs (458 LOC) is the only
out-of-database backup tool. Document usage, flags, and restore procedure
in BACKUP_AND_RESTORE.md.
UX-8 — Add a “Snapshot and Point-in-Time Recovery” section to
GETTING_STARTED.md with a worked example: take a snapshot, simulate data
loss, restore from snapshot, verify.
UX-9 — PRE_DEPLOYMENT.md does not mention change_buffer_durability
(unlogged|logged|sync, default unlogged) or the frontier_holdback_lsn_bytes
gauge. These are operationally critical: operators setting sync for durability
need to know the write-amplification cost before deployment.
Schema change: No.
UX-10 — BACKUP_AND_RESTORE.md does not warn about the atomicity gap
in snapshot_stream_table / restore_from_snapshot identified in §3.2 (no
subxact bracket; crash between CREATE TABLE AS and catalog INSERT leaves
an orphan table; mid-restore TRUNCATE + interrupted INSERT leaves
the storage table truncated). Add an explicit “Known Limitations” callout
until STAB-1 is fixed. Also document that restore_from_snapshot should not
be called while a concurrent refresh_stream_table is in flight.
Schema change: No.
UX-11 — FAQ.md has no entry for “How do I take a snapshot?” or
“How do I tune schedule_recommendation_min_samples?”. Add both with links
to BACKUP_AND_RESTORE.md and CONFIGURATION.md.
Schema change: No.
UX-12 — RELEASE.md checklist should include a step to verify the
SNAP/PLAN/CLUS/METR migration script
(sql/pg_trickle--0.26.0--0.27.0.sql) passes
scripts/check_upgrade_completeness.sh. Add to the release runbook.
Schema change: No.
UX-13 — PLAYGROUND.md has no snapshot demo. Add a snapshot_stream_table
→ data loss simulation → restore_from_snapshot example that works in the
playground/docker-compose.yml environment.
Schema change: No.
UX-14 — cnpg/cluster-example.yaml targets CNPG 1.28+. CNPG 1.29 ships
in 2026-Q2. Refresh the manifest, add a CI job using the updated image, and
verify the ImageVolume mount still works under the new CNPG operator.
Schema change: No.
UX-15 — dbt-pgtrickle/AGENTS.md pins dbt-core ~=1.10 and notes
Python 3.13 only (mashumaro dep can’t build on 3.14). dbt-core 1.11 is in
beta. Open a tracking GitHub issue: note the mashumaro blocker, watch the
1.11 release, and add a CI job testing against 1.11 once the dep resolves.
Schema change: No.
UX-16 — docs/integrations/multi-tenant.md is excellent but has no worked
example of a noisy-neighbour scenario. Add one: show a high-frequency stream
table starving others, the Prometheus alert expression, and the remediation
step (pg_trickle.per_db_refresh_quota_ms). Schema change: No.
UX-17 — PERFORMANCE_COOKBOOK.md should include a recipe:
“Use recommend_schedule to right-size a 100-table dbt project.” Walk
through a cold-start, initial history collection, and the first
auto-recommendation. Link to CONFIGURATION.md for the relevant GUCs.
Schema change: No.
UX-18 — dbt-pgtrickle/README.md does not state which extension version
it is tested against. Add a compatibility matrix row (e.g.
dbt-pgtrickle 0.5.x → pg_trickle ≥0.25.0) so users know what to pin.
Schema change: No.
Test Coverage
| ID | Title | Effort | Priority |
|---|---|---|---|
| TEST-1 | E2E: snapshot atomicity under crash (orphan table detection) | M | P0 |
| TEST-2 | E2E: snapshot version mismatch | S | P1 |
| TEST-3 | E2E: predictive planner with N < min_samples |
S | P1 |
| TEST-4 | E2E: multi-DB worker fairness under contention | M | P1 |
| TEST-5 | Integration: IVM_DELTA_CACHE bounded over 1000 ALTER QUERY cycles |
S | P1 |
| TEST-6 | Integration: classify_spi_error_retryable with lc_messages=fr_FR |
S | P0 |
| TEST-7 | Fuzz: WAL decoder | M | P1 |
| TEST-8 | Fuzz: MERGE template generator | M | P1 |
| TEST-9 | Fuzz: snapshot SQL builder | S | P1 |
| TEST-10 | Fuzz: DAG SCC graph shapes | S | P2 |
| TEST-11 | EC-01 deterministic reproducer (replaces flaky test_tpch_q07_*) |
M | P0 |
| TEST-12 | E2E: snapshot under concurrent refresh_stream_table |
S | P1 |
| TEST-13 | E2E: snapshot → pg_dump → pg_restore round-trip |
M | P1 |
| TEST-14 | Promote G17-MDB multi-database soak test from stability-tests.yml to ci.yml |
S | P1 |
| TEST-15 | E2E: WAL decoder failure injection (slot missing / wrong plugin) | S | P2 |
| TEST-16 | E2E: statement-level CDC with mixed INSERT/UPDATE/DELETE ordering invariants | S | P2 |
| TEST-17 | E2E: restore_from_snapshot rollback on mid-INSERT constraint violation |
S | P2 |
| TEST-18 | Property: cluster_worker_summary consistency under crash (proptest) |
S | P2 |
| TEST-19 | Property: predictive planner monotone in history length (proptest) | S | P2 |
Conflicts & Risks
- STAB-3 requires a schema migration (adding
cached_atcolumn). Include insql/pg_trickle--0.29.0--0.30.0.sql; verify viascripts/check_upgrade_completeness.sh. - CORR-1 touches the DVM join delta pipeline — the highest-risk module. Must be gated behind full TPC-H property test suite before merge.
- PERF-1 requires
dshashpgrx bindings that may not yet be fully stabilised in pgrx 0.18; de-risk with a spike before committing to the milestone. - SCAL-1 changes observable retry semantics. Roll out behind a
pg_trickle.use_sqlstate_classificationGUC (defaultfalse) initially, flip totruein v0.31.0 after one release of parallel validation. - UX-14 (CNPG 1.29) depends on CNPG 1.29 being available; if the release slips past the v0.30.0 window, defer to v1.0.0 but keep the tracking issue open.
- UX-15 (dbt 1.11) is purely a tracking concern — no code changes are required in this release.
Implementation Phases
| Phase | Description | Duration |
|---|---|---|
| Phase 1 | P0 correctness + stability: CORR-1, STAB-1, STAB-4, SCAL-1 | Days 1–6 |
| Phase 2 | P0 docs: UX-1, UX-2, TEST-6, TEST-11 | Days 6–9 |
| Phase 3 | P1 stability + safety: STAB-2, STAB-3, STAB-5, STAB-6, CORR-2, CORR-3 | Days 9–14 |
| Phase 4 | P1 architecture: SCAL-2, PERF-1, PERF-2 | Days 14–19 |
| Phase 5 | P1/P2 test coverage: TEST-1 through TEST-19, PERF-3 | Days 19–27 |
| Phase 6 | P1/P2/P3 docs & UX: UX-3 through UX-18, TUI parity | Days 27–34 |
v0.30.0 total: ~7–8 weeks (correctness-critical path ~10 days; documentation and test coverage ~14 days; architecture improvements ~8 days; new test targets ~8 days; Low-priority doc polish ~4 days)
Exit criteria:
- [x] CORR-1: test_tpch_q07_ec01b_combined_delete passes reliably over 50 runs; IMMEDIATE-mode property tests show zero phantom drift
- [x] CORR-2: Parser emits NOTICE when SubLink inside CASE/COALESCE forces FULL downgrade
- [x] CORR-3: restore_from_snapshot uses explicit column list; works on PG 18.0 and 18.x
- [x] STAB-1: Snapshot under crash leaves no orphan tables (TEST-1); restore under concurrent refresh is safe
- [x] STAB-2: IVM_DELTA_CACHE size bounded by template_cache_max_entries (TEST-5)
- [x] STAB-3: L2 template-cache catalog table purged by scheduler tick; cached_at column present in migration
- [x] STAB-4: snapshot_stream_table emits WARNING when catalog INSERT fails
- [x] STAB-5: wal_decoder.rs:307 uses c"test_decoding"; just lint clean
- [x] STAB-6: IVM cache cleared on subxact abort via XactCallback
- [x] SCAL-1: Retry classification by SQLSTATE; lc_messages=fr_FR test passes (TEST-6)
- [ ] PERF-3: All seven missing benchmarks present and passing Criterion baseline gate (deferred to v0.31.0)
- [x] UX-1: All Appendix C GUCs documented in CONFIGURATION.md
- [x] UX-2: UPGRADING.md covers v0.15.0 → v0.27.0
- [ ] UX-3: All new error variants have HINT/DETAIL in ERRORS.md (deferred to v0.31.0)
- [ ] UX-4: All eight SNAP/PLAN/CLUS/METR functions have full SQL_REFERENCE entries (deferred to v0.31.0)
- [ ] UX-5: TUI shows SNAP/PLAN/CLUS/METR panels (deferred to v0.31.0)
- [ ] UX-6: monitoring/grafana/pg_trickle.json ships and renders in Grafana 11+ (deferred to v0.31.0)
- [ ] UX-9: PRE_DEPLOYMENT.md documents change_buffer_durability and frontier_holdback_lsn_bytes (deferred to v0.31.0)
- [ ] UX-10: BACKUP_AND_RESTORE.md Known Limitations section added; concurrent-restore warning present (deferred to v0.31.0)
- [ ] UX-14: cnpg/cluster-example.yaml updated to CNPG 1.29; CI job green (deferred to v0.31.0)
- [ ] UX-15: dbt-core 1.11 tracking issue opened with mashumaro blocker documented (deferred to v0.31.0)
- [ ] TEST-7/8/9: New fuzz targets run in CI; no crashes after 24 h (deferred to v0.31.0)
- [ ] TEST-12: Snapshot under concurrent refresh test passes (deferred to v0.31.0)
- [ ] TEST-13: pg_dump → pg_restore round-trip test passes for snapshot tables (deferred to v0.31.0)
- [ ] TEST-14: G17-MDB multi-database soak test runs in ci.yml on every push to main (deferred to v0.31.0)
- [x] SCAL-3: Dead #[allow(unused_imports)] removed from refresh/orchestrator.rs; just lint clean
- [x] Extension upgrade path tested (0.29.0 → 0.30.0)
- [x] just check-version-sync passes