Contents
v0.34.0 Full Technical Details — Citus: Automated Distributed CDC & Shard Recovery
Status: Released (2026-04-26).
Goals
| ID | Goal | Priority |
|---|---|---|
| COORD-10 | Wire scheduler to call poll_worker_slot_changes() for distributed sources |
Required |
| COORD-11 | Wire scheduler to call ensure_worker_slot() on first tick and after topology changes |
Required |
| COORD-12 | Acquire + extend + release pgt_st_locks lease automatically in refresh cycle |
Required |
| COORD-13 | Detect pg_dist_node topology changes; auto-recover slots after rebalance |
Required |
| COORD-14 | Handle worker unreachability gracefully (skip + retry + alert) | Required |
| COORD-15 | Add pg_trickle.citus_worker_retry_ticks GUC (default 5) |
Required |
| COORD-16 | Extend citus_status view with lease health column and last-poll timestamp |
Medium |
| COORD-17 | Unit tests: topology change detection logic | Required |
| COORD-18 | Unit tests: shard rebalance detection edge cases | Medium |
Scheduler Changes
refresh_single_st Citus extension point
if source_placement == 'distributed' && is_citus_loaded():
1. acquire_st_lock(lock_key, holder, citus_st_lock_lease_ms)
2. for each worker in worker_nodes():
ensure_worker_slot(worker, slot_name)
poll_worker_slot_changes(worker, slot_name, max_changes, src)
3. extend_st_lock() if refresh duration > lease_ms / 2
4. [existing local refresh logic]
5. release_st_lock()
Topology change detection
On each scheduler tick:
SELECT array_agg(nodename ORDER BY nodeid)
FROM pg_dist_node
WHERE isactive AND noderole = 'primary'
Compare against pgt_worker_slots.worker_name set for this stream table. If
different: drop stale pgt_worker_slots rows, set needs_reinit = true on
affected stream tables.
New GUC
| GUC | Default | Description |
|---|---|---|
pg_trickle.citus_worker_retry_ticks |
5 |
Consecutive worker-poll failures before flagging in citus_status |
Migration: 0.33.0 → 0.34.0
No schema changes. The new scheduler behavior activates automatically for
stream tables with source_placement = 'distributed'. Operators should
remove any manual LISTEN + handle_vp_promoted() application logic — it is
now redundant (though harmless to leave in place).
Testing
- Unit tests: topology change detection logic —
test_topology_no_change,test_topology_worker_added,test_topology_worker_removed,test_topology_empty_recorded_live_workers,test_topology_both_empty,test_topology_port_change(all insrc/citus.rs). - COORD-17/18 full integration tests (3-node Citus Docker-Compose cluster) are planned for a future release when the CI Citus cluster is available.
Implementation Status
| ID | Title | Status |
|---|---|---|
| COORD-10 | Wire scheduler poll_worker_slot_changes() |
✅ Done |
| COORD-11 | Wire scheduler ensure_worker_slot() |
✅ Done |
| COORD-12 | Automatic pgt_st_locks lease lifecycle |
✅ Done |
| COORD-13 | Topology change detection + slot reconciliation | ✅ Done |
| COORD-14 | Worker failure handling (skip + retry + alert) | ✅ Done |
| COORD-15 | citus_worker_retry_ticks GUC |
✅ Done |
| COORD-16 | Extended citus_status view |
✅ Done |
| COORD-17 | Unit tests: topology detection | ✅ Done |
| COORD-18 | Unit tests: rebalance edge cases | ✅ Done |
Exit Criteria
- [x] All P0 items ✅ Done
- [x]
just test-unitpasses - [x]
just check-version-syncexits 0 - [x] CHANGELOG.md entry written
- [x] ROADMAP.md v0.34.0 row marked ✅ Released