v0.21.0 — Correctness, Safety, and Test Hardening

Full technical details: v0.21.0.md-full.md

Status: ✅ Released | Scope: Large (~6–8 weeks)

Closes the last known data-correctness gap in join delta computation, enforces a zero-crash guarantee across the codebase, expands unit test coverage in previously untested modules, and introduces a shadow/canary mode for safely testing query changes before going live.


What problem does this solve?

Systematic analysis of the codebase after v0.20.0 identified several categories of risk: a remaining join correctness bug (EC-01 phantom rows in multi-table LEFT/RIGHT JOINs), .unwrap() calls that could crash the PostgreSQL backend, modules with low test coverage, and the risk of disruption when modifying a production stream table’s query.


EC-01 JOIN Delta Phantom Row Fix

The EC-01 bug — phantom rows appearing in stream tables after specific sequences of DELETE and INSERT on multi-table JOINs — was first identified in v0.12.0 and partially addressed across several releases. v0.21.0 delivers the complete fix:

  • The row identity hash for Part 1b of the join delta algorithm (the “right-side unchanged” portion) is now computed correctly, ensuring both halves of the join delta emit the same row identifier and cancel each other out properly
  • Prior-cycle phantom rows are cleaned up by the refresh process
  • TPC-H Q07 (which exercises this pattern) is validated to pass deterministically across 5,000 randomised property test iterations

In plain terms: if your stream table computes a LEFT JOIN and rows are deleted and re-inserted on either side, the results are now always correct — never showing rows that should not be there.


Zero .unwrap() in Production Code

A clippy::unwrap_used lint rule was added that fails the build if any .unwrap() call appears in non-test code. Every .unwrap() in the production code path was converted to proper error handling that returns a descriptive error to the caller instead of crashing the backend.

The unsafe code surface was reduced by 40% through the introduction of safe wrapper functions around PostgreSQL C internals.


Non-Deterministic Function Rejection

At stream table creation time, pg_trickle now rejects (or warns about) queries that use non-deterministic functions like now(), random(), or volatile user-defined functions without an explicit non_deterministic => true acknowledgement. This prevents a whole class of subtle drift bugs before they reach production.


Test Coverage Campaign

Three large, previously under-tested modules received comprehensive unit test coverage:

  • api/helpers.rs (25+ new tests) — query validation, schema helpers, CDC orchestration utilities
  • api/diagnostics.rs (15+ new tests) — explain_st, health_summary, cache_stats formatting
  • dvm/parser/rewrites.rs (30+ new tests) — all seven SQL rewrite passes

A parser fuzz target was added that runs random SQL through the pg_trickle query parser for an hour without panics — any panic would indicate a code path that could crash the backend on unexpected input.

A crash-recovery test kills the background worker mid-refresh and verifies that the database is left in a consistent state: no partially-applied refreshes, WAL decoder resumed from the correct position.


Shadow / Canary Mode for Safer Query Changes

When you need to change a production stream table’s query, doing so directly is risky — the new query might produce different results or refresh more slowly.

alter_stream_table(name, dry_run_shadow => true) creates a shadow copy of the stream table (pgt_shadow_<name>) that runs the new query on the same schedule as the live table. Operators can compare the two versions with pgtrickle.canary_diff(name) before committing the change. When satisfied, pgtrickle.canary_promote(name) atomically swaps the shadow into production.

In plain terms: test your query change on real production data alongside the live version, verify the results match your expectations, then flip the switch.


New Operational Helpers

  • pgtrickle.pause_all() / resume_all() — suspend all stream tables at once during maintenance
  • pgtrickle.refresh_if_stale(name, max_age) — only trigger a refresh if the stream table is older than a specified age
  • pgtrickle.stream_table_definition(name) — returns the full definition of a stream table for auditing

Prometheus HTTP Endpoint

The background worker now serves Prometheus metrics directly over HTTP (port configurable via pg_trickle.metrics_port), removing the need for a separate Prometheus exporter process.


Performance Tuning Cookbook

A new docs/PERFORMANCE_COOKBOOK.md document consolidates all performance tuning advice — previously scattered across FAQ, TROUBLESHOOTING, and SCALING — into a single reference: symptom → likely cause → configuration to adjust → how to measure improvement.


Scope

v0.21.0 is a comprehensive quality release. The EC-01 fix closes the last known data-correctness gap. The zero-unwrap guarantee eliminates a class of potential backend crashes. The test coverage campaign and fuzz target significantly raise the floor on correctness confidence. Shadow/canary mode makes production query changes safer.