v0.24.0 — Join Correctness and Durability Hardening

Full technical details: v0.24.0.md-full.md

Status: ✅ Released | Scope: Large (~8–9 weeks)

The final resolution of the EC-01 join correctness bug, a two-phase commit that eliminates crash-recovery data loss, configurable change buffer durability, TOAST-aware change detection, and automatic cleanup of refresh history.


What problem does this solve?

Despite multiple partial fixes since v0.12.0, the EC-01 phantom row bug in multi-table LEFT/RIGHT/FULL JOINs was not fully resolved for all scenarios. Additionally, a crash between the change buffer TRUNCATE and the frontier update could cause changes to be replayed on restart. For deployments where the change buffer absolutely must survive a crash, there was no configuration option.


EC-01: Complete Join Correctness Fix

The root cause of EC-01 was that Part 1b of the join delta algorithm (which handles the case “left side changed, right side unchanged”) was computing its row identity hash differently from Part 1a (which handles “right side changed, left side unchanged”). When both parts ran, they should cancel each other out to zero net change for unchanged rows — but different hashes meant they did not cancel.

The fix ensures both parts use the same hash algorithm (left-side primary key only). Cross-cycle phantom rows — rows from previous refresh cycles that accumulated over time — are cleaned up by the PH-D1 phantom cleanup pass.

TPC-H Q15 is now re-enabled in IMMEDIATE mode. A 5,000-iteration property test verifies that no phantom accumulation occurs across any combination of INSERT, UPDATE, and DELETE sequences.

In plain terms: LEFT JOINs, RIGHT JOINs, and FULL OUTER JOINs are now permanently correct in differential mode under all combinations of changes.


Two-Phase Frontier Commit

The “frontier” is the watermark recording how far into the change history a stream table has been refreshed. Previously, the frontier was updated after the change buffer was truncated. A crash between truncation and frontier update meant changes would be replayed from the wrong position on restart.

The two-phase commit writes a tentative frontier before truncation and finalises it after the MERGE commits. On startup after a crash, any unfinished tentative frontier is reconciled. This eliminates the crash recovery window.


Configurable Change Buffer Durability

The pg_trickle.change_buffer_durability GUC provides three options:

  • unlogged (default) — fastest, but change buffer is empty after a crash (changes are recaptured from source tables)
  • logged — WAL-logged change buffers, survive crash with no data loss
  • sync — logged plus synchronous commit for maximum durability

Deployments with strict recovery requirements can now choose the right tradeoff between performance and durability.


TOAST-Aware CDC Hashing

PostgreSQL uses TOAST (The Oversized-Attribute Storage Technique) to store large values — long text, large JSON documents, binary data — out of line. The row-identity hash used by change capture was not correctly including TOAST column data, which meant that an UPDATE changing only a TOAST column (e.g. updating a JSONB column) could be silently missed.

The fix includes the size and a hash of TOAST column values in the row identity hash, ensuring that all column changes are captured regardless of their storage mechanism.

In plain terms: if your source tables use large JSON, XML, or binary columns, changes to those columns are now reliably detected.


Refresh History Retention

pg_trickle.refresh_history_retention_days (default 7) configures how long rows are kept in pgt_refresh_history. The background worker automatically prunes rows older than the retention period in small batches during idle ticks, preventing the history table from growing without bound.


Frozen Stream Table Detector

The df_frozen_stream_tables self-monitoring view flags any stream table that has not refreshed within 5× its configured refresh interval while its source tables are showing active change capture. This detects stuck or broken stream tables that are silently falling behind, and emits a pgtrickle_alert notification.


Scope

v0.24.0 closes the last known correctness gap (EC-01 join phantoms), adds a durability guarantee against crash-recovery data loss (two-phase frontier), and provides the operational tools (retention, frozen detector) to keep pg_trickle healthy over months of production operation.