Plain-language companion: v0.24.0.md
v0.24.0 — Join Correctness & Durability Hardening
Status: Released (2026-04-20). Sourced from PLAN_OVERALL_ASSESSMENT_2.md §3, §4, §6.
Release Theme
This release closes the remaining critical correctness bugs and
data-durability gaps identified in the v0.23.0 deep assessment.
The EC-01 join phantom-row bug — deferred since v0.21.0 — is finally
resolved, restoring full DIFFERENTIAL correctness for multi-table
LEFT/RIGHT/FULL JOINs under mixed DML. Change-buffer durability
becomes configurable, and a two-phase frontier commit eliminates the
crash-replay window. Supporting work includes TOAST-aware CDC hashing,
partitioned-source publication health checks, history retention, and
a unit-test campaign for the v0.21–v0.23 surface area.
EC-01 Join Correctness Fix
| Item |
Description |
Effort |
Ref |
|---|
| EC01-1 |
Row-id hash convergence for Part 1b. Modify src/dvm/operators/join.rs so the Part 1b arm (Δ⋈R₀) hashes only the left-side PK, ensuring both Part 1a and 1b emit the same __pgt_row_id for a given logical row. |
3d |
PLAN_OVERALL_ASSESSMENT_2.md §3 #1 |
| EC01-2 |
PH-D1 cross-cycle phantom cleanup. Extend the PH-D1 delete path in src/refresh/phd1.rs to reconcile orphaned row ids from prior cycles, not just the current delta. |
2d |
PLAN_OVERALL_ASSESSMENT_2.md §3 #1 |
| EC01-3 |
Remove Q15 from IMMEDIATE_SKIP_ALLOWLIST. Re-enable TPC-H Q15 in IMMEDIATE mode correctness tests after EC01-½ land. |
0.5d |
PLAN_OVERALL_ASSESSMENT_2.md §3 #1 |
| EC01-4 |
Proptest harness for join cross-cycle convergence. 5,000-iteration property test asserting INSERT/UPDATE/DELETE sequences on multi-table JOINs converge to the same result as a full refresh. |
2d |
PLAN_OVERALL_ASSESSMENT_2.md §3 #1 |
Durability & Frontier Atomicity
| Item |
Description |
Effort |
Ref |
|---|
| DUR-1 |
Two-phase frontier commit. Write a tentative frontier to a side column before TRUNCATE; finalise after MERGE commits; reconcile on startup. Unifies the manual-refresh and scheduler code paths. |
5d |
PLAN_OVERALL_ASSESSMENT_2.md §3 #2 |
| DUR-2 |
pg_trickle.change_buffer_durability GUC. New GUC with values unlogged (default, current behaviour), logged (WAL-logged change buffers), sync (logged + synchronous commit). |
3d |
PLAN_OVERALL_ASSESSMENT_2.md §3 #2 |
| DUR-3 |
Crash-recovery E2E test for frontier consistency. Kill bgworker between TRUNCATE and frontier-store; assert no phantom replays or lost rows on restart. |
2d |
PLAN_OVERALL_ASSESSMENT_2.md §3 #2 |
CDC Hardening
| Item |
Description |
Effort |
Ref |
|---|
| CDC-1 |
Eliminate two unwrap() sites in src/cdc.rs. Convert build_changed_cols_bitmask_expr().unwrap() to ? with a new PgTrickleError::ChangedColsBitmaskFailed variant. |
0.5d |
PLAN_OVERALL_ASSESSMENT_2.md §3 #3 |
| CDC-2 |
Partitioned-source publication rebuild. On scheduler tick, compare pg_publication_tables.pubviaroot against source relkind = 'p'; rebuild publication with publish_via_partition_root = true if mismatched. Emit refresh_reason = 'publication_rebuild'. |
3d |
PLAN_OVERALL_ASSESSMENT_2.md §3 #4 |
| CDC-3 |
TOAST-aware CDC hashing. Include pg_column_size() for TOASTable columns (attstorage IN ('e', 'x')) in the row-id hash to detect in-place TOAST rewrites. |
3d |
PLAN_OVERALL_ASSESSMENT_2.md §3 #5 |
| CDC-4 |
TOAST workload E2E tests. Add jsonb-update and bytea-update scenarios to tests/e2e_cdc_edge_case_tests.rs. |
1d |
PLAN_OVERALL_ASSESSMENT_2.md §6 |
Operational Improvements
| Item |
Description |
Effort |
Ref |
|---|
| OPS-1 |
pg_trickle.refresh_history_retention_days GUC. Default 7 days. Bgworker prunes stale rows in 1k-row batches during idle ticks. |
2d |
PLAN_OVERALL_ASSESSMENT_2.md §4 |
| OPS-2 |
Frozen-stream-table detector. New self-monitoring view df_frozen_stream_tables that flags any ST whose last_refresh_at < now() - 5 × refresh_interval with recent CDC activity. Alert via pgtrickle_alert NOTIFY. |
2d |
PLAN_OVERALL_ASSESSMENT_2.md §4 |
| OPS-3 |
Missing internal catalog indexes. Add composite indexes on pgt_stream_tables(status, scc_id), pgt_refresh_history(pgt_id, action, data_timestamp), pgt_change_tracking(source_relid), and a partial index on changes_<oid>(__pgt_action). |
1d |
PLAN_OVERALL_ASSESSMENT_2.md §5 |
Test Coverage (TEST-6/7/8)
| Item |
Description |
Effort |
Ref |
|---|
| TEST-6 |
Unit tests for src/api/publication.rs. Cover fit_linear_regression, predict_diff_duration_ms, should_preempt_to_full, assign_tier_for_sla, maybe_adjust_tier_for_sla, boundary cases (0, negative, NaN). 25+ tests. |
3d |
PLAN_OVERALL_ASSESSMENT_2.md §6 |
| TEST-7 |
Unit tests for src/api/diagnostics.rs. Cover explain_query_rewrite, diagnose_errors, validate_query, 5 gather_* helpers. 20+ tests. |
2d |
PLAN_OVERALL_ASSESSMENT_2.md §6 |
| TEST-8 |
Unit tests for src/metrics_server.rs. Cover port-conflict handling, timeout behaviour, malformed HTTP request, OpenMetrics format conformance. 10+ tests. |
1d |
PLAN_OVERALL_ASSESSMENT_2.md §6 |
Implementation Phases
| Phase |
Description |
Duration |
|---|
| Phase 1 |
EC-01 fix: row-id hash convergence + PH-D1 cleanup + proptest |
Days 1–8 |
| Phase 2 |
Durability: two-phase frontier + change_buffer_durability GUC + crash test |
Days 8–18 |
| Phase 3 |
CDC hardening: unwrap removal, publication rebuild, TOAST hashing + tests |
Days 18–26 |
| Phase 4 |
Operational: history retention, frozen-ST detector, catalog indexes |
Days 26–31 |
| Phase 5 |
Test campaign: TEST-6/7/8 unit tests for publication, diagnostics, metrics |
Days 31–37 |
| Phase 6 |
Integration testing, documentation, upgrade script |
Days 37–42 |
v0.24.0 total: ~8–9 weeks (~42 person-days solo)
Exit criteria:
- [x] EC01-1: Part 1b arm hashes left-side PK only; TPC-H Q07 passes multi-cycle correctness
- [x] EC01-2: PH-D1 cleans up prior-cycle phantoms; no residual rows after 10 cycles
- [x] EC01-3: Q15 removed from IMMEDIATE_SKIP_ALLOWLIST; TPC-H Q15 passes IMMEDIATE mode
- [x] EC01-4: 5,000-iteration proptest passes for JOIN convergence
- [x] DUR-1: Two-phase frontier commit implemented; manual and scheduler paths unified
- [x] DUR-2: change_buffer_durability = 'logged' creates WAL-logged change buffers; 'unlogged' preserves current behaviour
- [x] DUR-3: Crash-recovery E2E: kill bgworker mid-refresh → restart → zero lost/duplicated rows
- [x] CDC-1: Zero unwrap() calls in src/cdc.rs production paths
- [x] CDC-2: Converting a source table to partitioned triggers automatic publication rebuild
- [x] CDC-3: TOAST-only column update detected and propagated in DIFFERENTIAL mode
- [x] CDC-4: jsonb + bytea TOAST E2E tests pass
- [x] OPS-1: History older than retention_days is pruned automatically; GUC documented
- [x] OPS-2: Frozen-ST detector fires alert when ST stalls with active CDC source
- [x] OPS-3: Internal catalog indexes exist; scheduler tick time reduced at 100+ STs
- [x] TEST-6: 25+ publication.rs unit tests pass (predictive model boundary cases)
- [x] TEST-7: 20+ diagnostics.rs unit tests pass
- [x] TEST-8: 10+ metrics_server.rs unit tests pass (port conflict, timeout, format)
- [x] Extension upgrade path tested (0.23.0 → 0.24.0)
- [x] just check-version-sync passes