v0.34.0 Full Technical Details — Citus: Automated Distributed CDC & Shard Recovery

Status: Released (2026-04-26).

Goals

ID Goal Priority
COORD-10 Wire scheduler to call poll_worker_slot_changes() for distributed sources Required
COORD-11 Wire scheduler to call ensure_worker_slot() on first tick and after topology changes Required
COORD-12 Acquire + extend + release pgt_st_locks lease automatically in refresh cycle Required
COORD-13 Detect pg_dist_node topology changes; auto-recover slots after rebalance Required
COORD-14 Handle worker unreachability gracefully (skip + retry + alert) Required
COORD-15 Add pg_trickle.citus_worker_retry_ticks GUC (default 5) Required
COORD-16 Extend citus_status view with lease health column and last-poll timestamp Medium
COORD-17 Unit tests: topology change detection logic Required
COORD-18 Unit tests: shard rebalance detection edge cases Medium

Scheduler Changes

refresh_single_st Citus extension point

if source_placement == 'distributed' && is_citus_loaded():
    1. acquire_st_lock(lock_key, holder, citus_st_lock_lease_ms)
    2. for each worker in worker_nodes():
           ensure_worker_slot(worker, slot_name)
           poll_worker_slot_changes(worker, slot_name, max_changes, src)
    3. extend_st_lock() if refresh duration > lease_ms / 2
    4. [existing local refresh logic]
    5. release_st_lock()

Topology change detection

On each scheduler tick:

SELECT array_agg(nodename ORDER BY nodeid)
FROM pg_dist_node
WHERE isactive AND noderole = 'primary'

Compare against pgt_worker_slots.worker_name set for this stream table. If different: drop stale pgt_worker_slots rows, set needs_reinit = true on affected stream tables.

New GUC

GUC Default Description
pg_trickle.citus_worker_retry_ticks 5 Consecutive worker-poll failures before flagging in citus_status

Migration: 0.33.0 → 0.34.0

No schema changes. The new scheduler behavior activates automatically for stream tables with source_placement = 'distributed'. Operators should remove any manual LISTEN + handle_vp_promoted() application logic — it is now redundant (though harmless to leave in place).

Testing

  • Unit tests: topology change detection logic — test_topology_no_change, test_topology_worker_added, test_topology_worker_removed, test_topology_empty_recorded_live_workers, test_topology_both_empty, test_topology_port_change (all in src/citus.rs).
  • COORD-17/18 full integration tests (3-node Citus Docker-Compose cluster) are planned for a future release when the CI Citus cluster is available.

Implementation Status

ID Title Status
COORD-10 Wire scheduler poll_worker_slot_changes() ✅ Done
COORD-11 Wire scheduler ensure_worker_slot() ✅ Done
COORD-12 Automatic pgt_st_locks lease lifecycle ✅ Done
COORD-13 Topology change detection + slot reconciliation ✅ Done
COORD-14 Worker failure handling (skip + retry + alert) ✅ Done
COORD-15 citus_worker_retry_ticks GUC ✅ Done
COORD-16 Extended citus_status view ✅ Done
COORD-17 Unit tests: topology detection ✅ Done
COORD-18 Unit tests: rebalance edge cases ✅ Done

Exit Criteria

  • [x] All P0 items ✅ Done
  • [x] just test-unit passes
  • [x] just check-version-sync exits 0
  • [x] CHANGELOG.md entry written
  • [x] ROADMAP.md v0.34.0 row marked ✅ Released