Contents
v0.34.0 — Citus: Automated Distributed CDC & Shard Recovery
Full technical details: v0.34.0.md-full.md
Status: Planned | Scope: Medium
Complete the Citus distributed CDC story: wire the scheduler to automatically poll per-worker WAL slots, auto-recover from shard rebalances, and extend the
pgt_st_lockslease during long refreshes — making distributed stream tables fully hands-off to operate.
What is this?
v0.33.0 shipped all the infrastructure for distributed Citus CDC — per-worker
WAL slots, pgt_st_locks catalog coordination, poll_worker_slot_changes, and
handle_vp_promoted. But the scheduler’s main loop did not yet call any of it.
Operators had to wire things up manually.
v0.34.0 closes that gap. The scheduler becomes fully aware of distributed sources and drives the per-worker slot lifecycle automatically.
Automated scheduler integration
When a stream table source has source_placement = 'distributed', the
scheduler now:
Calls
ensure_worker_slot()on the first tick after a source is registered (or after a rebalance), creating the logical replication slot on each active Citus worker viadblink.Calls
poll_worker_slot_changes()before the local refresh step, draining each worker slot into the coordinator-local change buffer.Acquires a
pgt_st_lockslease (duration =pg_trickle.citus_st_lock_lease_ms) before touching any worker slot, and callsextend_st_lock()periodically during long refreshes so the lease does not expire mid-apply.Releases the lease via
release_st_lock()after the refresh completes.
This makes the end-to-end flow — from VP table promotion through per-worker slot creation, change polling, and coordinated apply — fully automatic.
Shard rebalance auto-recovery
In v0.33.0, a Citus shard rebalance invalidated per-worker WAL slots and required manual table recreation. v0.34.0 changes this:
The scheduler detects topology changes by comparing the current
pg_dist_nodesnapshot against the set stored inpgt_worker_slots.When a change is detected, it drops stale slot entries from
pgt_worker_slotsand marks affected stream tables for a full refresh.On the next tick,
ensure_worker_slot()recreates slots on any new workers, and the full refresh catches up from scratch.
Operators no longer need to intervene after a citus_rebalance_start().
Worker failure handling
If poll_worker_slot_changes() fails (worker unreachable, network timeout):
- The scheduler logs the error and skips that worker’s changes for the tick.
- On the next tick it retries automatically.
- If the same worker fails for more than pg_trickle.citus_worker_retry_ticks
consecutive ticks, the stream table is flagged in citus_status for operator
attention, but refreshes continue against healthy workers.
Scope
v0.34.0 is a medium-sized release. The scheduler integration touches
refresh_single_st and the coordinator tick watermark logic. Changes are
gated behind is_citus_loaded() so there is zero impact on non-Citus
deployments.
Previous: v0.33.0 — Citus: World-Class Distributed Source CDC Next: v0.35.0 — Reactive Subscriptions & Zero-Downtime Operations