Plain-language companion: v0.20.0.md
v0.20.0 — Self-Monitoring (pg_trickle Monitors Itself)
Status: Released (2026-04-15). All 62 items implemented, 1 skipped (PERF-6 already shipped in v0.19.0).
Release Theme This release implements self-monitoring: pg_trickle uses its own stream tables to maintain reactive analytics over its internal catalog and refresh-history tables. Five self-monitoring stream tables (
df_efficiency_rolling,df_anomaly_signals,df_threshold_advice,df_cdc_buffer_trends,df_scheduling_interference) replace repeated full-scan diagnostic functions with continuously-maintained incremental views, enable multi-cycle trend detection for threshold tuning, and surface anomalies reactively. An optional auto-apply policy layer can automatically adjustauto_thresholdwhen confidence is high. This validates pg_trickle on its own non-trivial workload and demonstrates the incremental analytics value proposition to users.See plans/PLAN_SELF_MONITORING.md for the full design, architecture, and risk analysis.
Phase 1 — Foundation
| Item | Description | Effort | Ref |
|---|---|---|---|
| DF-F1 | Verify CDC on pgt_refresh_history. Confirm that create_stream_table() installs INSERT triggers on pgt_refresh_history. Fix schema-exclusion logic if the pgtrickle schema is skipped. |
2–4h | PLAN_SELF_MONITORING.md §7 Phase 1 |
| DF-F2 | Create df_efficiency_rolling (DF-1). Maintained rolling-window aggregates over pgt_refresh_history. Replaces refresh_efficiency() full scans. |
2–4h | PLAN_SELF_MONITORING.md §5 DF-1 |
| DF-F3 | E2E test: DF-1 output matches refresh_efficiency(). Insert synthetic history rows, refresh DF-1, assert aggregates agree. |
2–4h | PLAN_SELF_MONITORING.md §8 |
| DF-F4 | pgtrickle.setup_self_monitoring() helper. Single SQL call that creates all five df_* stream tables. |
2–4h | PLAN_SELF_MONITORING.md §7 Phase 4 |
| DF-F5 | pgtrickle.teardown_self_monitoring() helper. Drops all df_* stream tables cleanly. |
1h | PLAN_SELF_MONITORING.md §7 Phase 4 |
Phase 2 — Anomaly Detection
| Item | Description | Effort | Ref |
|---|---|---|---|
| DF-A1 | Create df_anomaly_signals (DF-2). Detects duration spikes, error bursts, and mode oscillation by comparing recent behavior against DF-1 baselines. |
3–5h | PLAN_SELF_MONITORING.md §5 DF-2 |
| DF-A2 | Create df_threshold_advice (DF-3). Multi-cycle threshold recommendation replacing the single-step compute_adaptive_threshold() convergence. |
3–5h | PLAN_SELF_MONITORING.md §5 DF-3 |
| DF-A3 | Verify DAG ordering. DF-1 refreshes before DF-2 and DF-3. | 1–2h | PLAN_SELF_MONITORING.md §7 Phase 2 |
| DF-A4 | E2E test: threshold spike detection. Inject synthetic history making DIFF consistently fast; assert DF-3 recommends raising the threshold. | 2–4h | PLAN_SELF_MONITORING.md §8 |
| DF-A5 | E2E test: anomaly duration spike. Inject a 3× duration spike; assert DF-2 detects it. | 2–4h | PLAN_SELF_MONITORING.md §8 |
Phase 3 — CDC Buffer & Interference
| Item | Description | Effort | Ref |
|---|---|---|---|
| DF-C1 | Create df_cdc_buffer_trends (DF-4). Tracks change-buffer growth rates per source table. May require pgtrickle.cdc_buffer_row_counts() helper for dynamic table names. |
4–8h | PLAN_SELF_MONITORING.md §5 DF-4 |
| DF-C2 | Create df_scheduling_interference (DF-5). Detects concurrent refresh overlap. FULL-refresh mode initially (bounded 1-hour window). |
3–5h | PLAN_SELF_MONITORING.md §5 DF-5 |
| DF-C3 | E2E test: scheduling overlap detection. Create 3 STs with overlapping schedules; verify DF-5 detects overlap. | 2–4h | PLAN_SELF_MONITORING.md §8 |
Phase 4 — GUC & Auto-Apply
| Item | Description | Effort | Ref |
|---|---|---|---|
| DF-G1 | pg_trickle.self_monitoring_auto_apply GUC. Values: off (default) / threshold_only / full. Registered in src/config.rs. |
1–2h | PLAN_SELF_MONITORING.md §6.2 |
| DF-G2 | Auto-apply worker (threshold_only). Post-tick hook reads df_threshold_advice; applies ALTER STREAM TABLE ... SET auto_threshold = <recommended> when confidence is HIGH and delta > 5%. Rate-limited to 1 change per ST per 10 minutes. |
4–8h | PLAN_SELF_MONITORING.md §7 Phase 5 |
| DF-G3 | initiated_by = 'SELF_MONITOR' audit trail. Log auto-apply changes to pgt_refresh_history. |
1–2h | PLAN_SELF_MONITORING.md §7 Phase 5 |
| DF-G4 | E2E test: auto-apply threshold. Enable threshold_only, inject history making DIFF consistently faster, verify threshold increases automatically. |
2–4h | PLAN_SELF_MONITORING.md §8 |
| DF-G5 | E2E test: rate limiting. Verify no more than 1 threshold change per ST per 10 minutes. | 1–2h | PLAN_SELF_MONITORING.md §8 |
Phase 5 — Operational Diagnostics
| Item | Description | Effort | Ref |
|---|---|---|---|
| OPS-1 | pgtrickle.recommend_refresh_mode(st_name) Reads df_threshold_advice to return a structured recommendation { mode, confidence, reason } rather than computing on demand. |
2–4h | PLAN_SELF_MONITORING.md §10.6 |
| OPS-2 | check_cdc_health() spill-risk enrichment. Query df_cdc_buffer_trends growth rate; emit a spill_risk alert when buffer growth will breach spill_threshold_blocks within 2 cycles. |
2–4h | PLAN_SELF_MONITORING.md §10.3 |
| OPS-3 | pgtrickle.scheduler_overhead() diagnostic function. Returns busy-time ratio, queue depth, avg dispatch latency, and fraction of CPU spent on DF STs vs user STs. |
2–4h | — |
| OPS-4 | pgtrickle.explain_dag() — Mermaid/DOT output. Returns DAG as Mermaid markdown with node colours: user=blue, self-monitoring=green, suspended=red. |
3–4h | — |
| OPS-5 | sql/self_monitoring_setup.sql quick-start template. Runnable script: call setup_self_monitoring(), set self_monitoring_auto_apply = 'threshold_only', configure LISTEN, query initial recommendations. |
1h | — |
| OPS-6 | Workload-aware poll intervals via DF-5 signal. Replace compute_adaptive_poll_ms() exponential backoff with pre-emptive dispatch interval widening when df_scheduling_interference detects contention. |
2–4h | PLAN_SELF_MONITORING.md §10.2 |
| DASH-1 | Grafana Self-Monitoring Dashboard. New monitoring/grafana/dashboards/pg_trickle_self_monitoring.json — 5 panels reading from DF-1 through DF-5. |
4–6h | PLAN_SELF_MONITORING.md §10.5 |
| DBT-1 | dbt pgtrickle_enable_monitoring post-hook macro. Calls setup_self_monitoring() automatically after a successful dbt run; documented in dbt-pgtrickle/. |
2h | — |
OPS-1 — pgtrickle.recommend_refresh_mode(st_name text)
Reads directly from
df_threshold_adviceinstead of computing a single-cycle cost comparison on demand (PLAN_SELF_MONITORING.md §10.6). ReturnsTABLE(mode text, confidence text, reason text). When confidence is LOW (< 10 history rows), emits a fallback with mode='AUTO'and a reason explaining insufficient data. Integrates withexplain_st()output.Verify: call on an ST with ≥ 20 history cycles; assert
mode∈{'DIFFERENTIAL','FULL','AUTO'}andconfidence∈{'HIGH','MEDIUM','LOW'}. Dependencies: DF-A2. Schema change: No.
OPS-2 — check_cdc_health() spill-risk enrichment
Currently
check_cdc_health()performs full-table scans to detect anomalies. When DF-C1 is active, querydf_cdc_buffer_trendsgrowth rate instead. Emit aspill_risk = 'IMMINENT'row when the 1-cycle growth rate extrapolated 2 cycles ahead exceedsspill_threshold_blocks. Falls back to full scan when self-monitoring is not set up.Verify: inject 80% of
spill_threshold_blocksworth of buffer rows with a steep growth rate; assertcheck_cdc_health()returns a spill-risk alert. Dependencies: DF-C1. Schema change: No.
OPS-3 — pgtrickle.scheduler_overhead() diagnostic function
Returns a snapshot of scheduler efficiency:
scheduler_busy_ratio(fraction of wall-clock time spent executing refreshes),queue_depth(STs waiting to be dispatched),avg_dispatch_latency_ms,df_refresh_fraction(fraction of busy time attributable to DF STs). This makes PERF-3’s < 1% CPU target observable in production without custom monitoring.Verify: function returns non-NULL values after 5+ refresh cycles; assert
df_refresh_fraction < 0.01in the soak test context. Dependencies: DF-D4. Schema change: No (new function only).
OPS-4 — pgtrickle.explain_dag() — Mermaid / DOT graph output
Returns the full refresh DAG as a Mermaid markdown string (default) or Graphviz DOT (via
format => 'dot'argument). Node labels show ST name, current mode, and refresh interval. Node colours: user STs = blue, self-monitoring STs = green, suspended = red, fused = orange. Edges show dependency direction. Validates that DF-1 → DF-2 → DF-3 ordering is correct post-setup.Verify:
SELECT pgtrickle.explain_dag()aftersetup_self_monitoring()returns a string containing all fivedf_nodes in green with correct edges. Dependencies: None. Schema change: No (new function only).
OPS-5 — sql/self_monitoring_setup.sql quick-start template
A standalone SQL script in
sql/that an operator can run withpsql -f sql/self_monitoring_setup.sql. Contents: callssetup_self_monitoring(), setspg_trickle.self_monitoring_auto_apply = 'threshold_only', runsLISTEN pg_trickle_alert, queriesself_monitoring_status()for a status summary, and queriesdf_threshold_advicefor initial recommendations with a warm-up note. Referenced from GETTING_STARTED.md Day 2 operations section (UX-4).Verify: script executes without errors on a fresh install; produces visible output showing 5 active DF STs. Dependencies: DF-F4, DF-G1, UX-4. Schema change: No.
OPS-6 — Workload-aware poll intervals via DF-5 signal
Currently
compute_adaptive_poll_ms()uses pure exponential backoff that reacts to contention only after it occurs. Replace this with a pre-emptive signal: after each scheduler tick, read the latestoverlap_countfromdf_scheduling_interference; ifoverlap_count >= 2, increase the dispatch interval for the next tick by 20% before dispatching (capped atpg_trickle.max_poll_interval_ms). This closes the self-monitoring feedback loop by letting the analytics directly influence scheduling policy, reducing contention on write-heavy deployments without waiting for timeouts.Verify: soak test with known-contending STs shows lower
overlap_countin DF-5 with signal enabled vs disabled.scheduler_overhead()shows reduced busy-time ratio. Dependencies: DF-C2, OPS-3. Schema change: No.
DASH-1 — Grafana Self-Monitoring Dashboard
Add
monitoring/grafana/dashboards/pg_trickle_self_monitoring.jsonalongside the existingpg_trickle_overview.json. Five panels: (1) Refresh throughput timeline (DF-1avg_diff_msover time), (2) Anomaly heatmap (DF-2 per-ST anomaly type grid), (3) Threshold calibration scatter (DF-3 current vs recommended threshold), (4) CDC buffer growth sparklines (DF-4 per-source growth rate), (5) Interference matrix (DF-5 overlap heatmap). Provisioned automatically inmonitoring/grafana/provisioning/.Verify:
docker compose upinmonitoring/loads both dashboards; all five panels resolve withoutNo dataerrors using the postgres-exporter queries. Dependencies: DF-F2, DF-A1, DF-A2, DF-C1, DF-C2. Schema change: No.
DBT-1 — pgtrickle_enable_monitoring dbt post-hook macro
Add a
pgtrickle_enable_monitoringmacro todbt-pgtrickle/macros/that calls{{ pgtrickle.setup_self_monitoring() }}and emits alog()message confirming activation. Documented indbt-pgtrickle/README.md. Users add+post-hook: "{{ pgtrickle_enable_monitoring() }}"todbt_project.ymlto auto-enable monitoring after anydbt run. Idempotent — safe to call on every run becausesetup_self_monitoring()is already idempotent (STAB-1).Verify:
just test-dbtincludes a test case that runs the macro twice; assertsself_monitoring_status()shows 5 active STs after both calls. Dependencies: DF-F4, STAB-1. Schema change: No.
Documentation & Safety
| Item | Description | Effort | Ref |
|---|---|---|---|
| DF-D1 | SQL_REFERENCE.md: self-monitoring quick start. Document setup_self_monitoring(), teardown_self_monitoring(), all five df_* stream tables, and the auto-apply GUC. |
2–4h | — |
| DF-D2 | CONFIGURATION.md: pg_trickle.self_monitoring_auto_apply GUC. |
1h | — |
| DF-D3 | E2E test: control plane survives DF ST suspension. Drop or suspend all df_* STs; verify the scheduler and refresh logic operate identically. |
2–4h | PLAN_SELF_MONITORING.md §8 |
| DF-D4 | Soak test addition. Add self-monitoring STs to the existing soak test; verify no memory growth or scheduler stalls under 1-hour sustained load. | 2–4h | PLAN_SELF_MONITORING.md §8 |
Correctness
| ID | Title | Effort | Priority |
|---|---|---|---|
| CORR-1 | df_threshold_advice output always within [0.01, 0.80] |
S | P0 |
| CORR-2 | DF-2 suppresses false-positive spike on first-ever refresh | S | P0 |
| CORR-3 | avg_change_ratio never NaN/Inf on zero-delta streams |
S | P0 |
| CORR-4 | CDC INSERT-only invariant verified on pgt_refresh_history |
XS | P1 |
| CORR-5 | DF-1 historical window boundary is exclusive, not inclusive | XS | P1 |
CORR-1 — df_threshold_advice output always within [0.01, 0.80]
The
LEAST(0.80, GREATEST(0.01, …))expression in DF-3 must hold for all input combinations including NULLavg_diff_ms, zeroavg_full_ms, and extreme ratios. Add a property-based test (proptest) that generates random(avg_diff_ms, avg_full_ms, current_threshold)triples and asserts the output is always in the valid range. Any value outside [0.01, 0.80] that reaches auto-apply would corrupt stream table configuration.Verify: proptest with 10,000 iterations; zero out-of-range results. Dependencies: DF-A2. Schema change: No.
CORR-2 — DF-2 suppresses false-positive spike on first-ever refresh
df_anomaly_signalscompareslatest.duration_msagainsteff.avg_diff_ms. On the very first refresh of a stream table there is no rolling average yet (eff.avg_diff_ms IS NULL), so theCASE WHENwould produce no anomaly. Confirm the LATERAL subquery returns NULL (not 0) when history is empty, and that theCASEguard is> 3.0 * NULLIF(eff.avg_diff_ms, 0)so a NULL baseline never triggers a spike.Verify: E2E test creating a brand-new ST; assert
duration_anomaly IS NULLon first DF-2 refresh. Dependencies: DF-A1. Schema change: No.
CORR-3 — avg_change_ratio never NaN/Inf on zero-delta streams
DF-1 computes
avg(h.delta_row_count::float / NULLIF(h.rows_inserted + h.rows_deleted, 0)). If a stream table runs only FULL refreshes (no DIFF cycles) the divisor is always NULL andavg()returns NULL — correct. But if DIFF runs with exactly zero rows inserted and zero deleted (CDC buffer was empty),NULLIFmust prevent a divide-by-zero NaN. Verify the guard holds and thatavg_change_ratiois either a valid float in [0, 1] or NULL.Verify: E2E test triggering a DIFF refresh on a quiescent source; assert
avg_change_ratio IS NULL OR avg_change_ratio BETWEEN 0 AND 1. Dependencies: DF-F2. Schema change: No.
CORR-4 — CDC INSERT-only invariant verified on pgt_refresh_history
pgt_refresh_historyis semantically append-only: rows are only ever INSERTed (one per refresh). The CDC trigger installed by DF-F1 must be an INSERT-only trigger (no UPDATE/DELETE triggers). If the trigger were registered asFOR EACH ROW AFTER INSERT OR UPDATE, a future catalog UPDATE would generate spurious change-buffer rows and corrupt DF-1 aggregates. Inspectpg_triggerto confirm only anINSERTtrigger exists.Verify:
SELECT tgtype FROM pg_trigger WHERE tgrelid = 'pgtrickle.pgt_refresh_history'::regclassreturns only INSERT-event triggers. Dependencies: DF-F1. Schema change: No.
CORR-5 — DF-1 historical window boundary is exclusive, not inclusive
The
WHERE h.start_time > now() - interval '1 hour'clause uses a strict>comparison. This ensures a row withstart_timeexactly equal to the boundary is excluded on each pass, preventing double-counting in rolling aggregates. Confirm the query plan uses the index on(pgt_id, start_time)(see PERF-2) and that the boundary is consistent across DF-1, DF-2, and DF-4 (all use the same 1-hour lookback).Verify: unit test comparing aggregate output with a row at the exact boundary; assert it is excluded. Dependencies: DF-F2. Schema change: No.
Stability
| ID | Title | Effort | Priority |
|---|---|---|---|
| STAB-1 | setup_self_monitoring() is fully idempotent |
S | P0 |
| STAB-2 | Auto-apply handles ALTER STREAM TABLE failure gracefully |
S | P0 |
| STAB-3 | DF STs survive DROP EXTENSION + CREATE EXTENSION cycle |
S | P1 |
| STAB-4 | Auto-apply worker checks ST still exists before applying | XS | P1 |
| STAB-5 | teardown_self_monitoring() is safe when some DF STs already removed |
XS | P1 |
STAB-1 — setup_self_monitoring() is fully idempotent
Calling
setup_self_monitoring()a second time while DF STs already exist must not raise an error. UseIF NOT EXISTSsemantics internally (or check catalog before creating). The function must also be safe to call concurrently from two sessions. Idempotency is critical for upgrade scripts and Terraform-style declarative deployment workflows.Verify: call
setup_self_monitoring()three times in a row; no errors, no duplicate stream tables. Dependencies: DF-F4. Schema change: No.
STAB-2 — Auto-apply handles ALTER STREAM TABLE failure gracefully
The auto-apply post-tick hook reads
df_threshold_adviceand issuesALTER STREAM TABLE … SET auto_threshold = <recommended>. If the stream table was dropped between the advice read and the apply (a TOCTOU race), the ALTER will error. Catch SQL errors in the post-tick hook with an appropriatematchonPgTrickleErrorand log a WARNING rather than crashing the background worker.Verify: unit test with a mocked
ALTERthat returnsERROR: relation does not exist; assert the worker logs a warning and continues to the next advice row. Dependencies: DF-G2. Schema change: No.
STAB-3 — DF STs survive DROP EXTENSION + CREATE EXTENSION cycle
DROP EXTENSION pg_trickle CASCADEdrops all extension-owned objects. AfterCREATE EXTENSION pg_trickle,setup_self_monitoring()should recreate the DF STs cleanly. There must be no leftover triggers, orphaned change buffer tables, or stale catalog rows from the previous installation. This is the most likely failure mode after an emergency rollback + reinstall.Verify: E2E test:
setup_self_monitoring()→DROP EXTENSION CASCADE→CREATE EXTENSION→setup_self_monitoring()→ insert history → refresh DF-1; assert correct aggregates. Dependencies: DF-F4, DF-F5. Schema change: No.
STAB-4 — Auto-apply worker checks ST still exists before applying
Before issuing
ALTER STREAM TABLE, the worker should confirm the ST is still inpgt_stream_tablesand is not in SUSPENDED or FUSED state. Applying a threshold change to a SUSPENDED ST is harmless but wasteful; applying to a FUSED ST is wrong (the fuse exists for a reason). Add a pre-apply guard in the Rust post-tick hook.Verify: E2E test suspending an ST manually while auto-apply is enabled; assert no threshold change is applied-to a suspended stream table. Dependencies: DF-G2. Schema change: No.
STAB-5 — teardown_self_monitoring() is safe when some DF STs already removed
If a user manually drops
df_anomaly_signalsbefore callingteardown_self_monitoring(), the teardown function must not error onDROP STREAM TABLE df_anomaly_signals. Usedrop_stream_table(name, if_exists => true)semantics for each DF table in the teardown. Otherwise a partial teardown leaves the system in an inconsistent state.Verify: drop two DF STs manually, then call
teardown_self_monitoring(); assert no errors and remaining DF STs are gone. Dependencies: DF-F5. Schema change: No.
Performance
| ID | Title | Effort | Priority |
|---|---|---|---|
| PERF-1 | Index on pgt_refresh_history(pgt_id, start_time) for DF queries |
XS | P0 |
| PERF-2 | Benchmark DF-1 vs refresh_efficiency() on 10 K history rows |
S | P0 |
| PERF-3 | Dog-feeding scheduler overhead target: < 1% of total CPU | S | P1 |
| PERF-4 | DF-5 self-join uses bounded index scan, not seq-scan | S | P1 |
| PERF-5 | History pruning batch-DELETE with short transactions (no CDC lock contention) | S | P1 |
| PERF-6 | Columnar change tracking Phase 1 — CDC bitmask (deferred from v0.17/v0.18) | M | P1 |
PERF-1 — Index on pgt_refresh_history(pgt_id, start_time) for DF queries
All five DF stream tables filter
pgt_refresh_historyon(pgt_id, start_time). Without a composite index on these columns the rolling-window WHERE clause forces a sequential scan of the growing history table. Verify the index was created during extension install (check the upgrade migration); if missing, add it as part of the 0.19.0 → 0.20.0 migration script.Verify:
EXPLAIN (FORMAT TEXT) SELECT … FROM pgtrickle.pgt_refresh_history WHERE pgt_id = 1 AND start_time > now() - interval '1 hour'shows an index scan. Schema change: Yes (index addition in migration script).
PERF-2 — Benchmark DF-1 vs refresh_efficiency() on 10 K history rows
The primary performance claim of self-monitoring is that a maintained DIFFERENTIAL stream table is cheaper than scanning the full history table on every diagnostic call. Establish a Criterion micro-benchmark that seeds 10 K history rows, then compares: (a) a full
SELECT * FROM pgtrickle.refresh_efficiency()call vs (b) aSELECT * FROM pgtrickle.df_efficiency_rollingread after one incremental refresh. The benchmark documents the win concretely.Verify: Criterion benchmark shows DF-1 read is at least 5× faster than
refresh_efficiency()at 10 K rows. Included inbenches/and run in CI. Dependencies: DF-F2. Schema change: No.
PERF-3 — Dog-feeding scheduler overhead target: < 1% of total CPU
Five DF STs at 48–96 s schedules add background refresh work. Under a realistic load (20 user STs, 10 K history rows), the total time spent refreshing DF STs should be < 1% of total scheduler CPU. Measure in the E2E soak test by comparing scheduler loop busy-time with and without DF STs. If overhead exceeds 1%, relax schedules to 120 s or move DF STs to
refresh_tier = 'cold'.Verify: soak test reports DF refresh overhead as a fraction of total scheduler CPU; assert < 1%. Dependencies: DF-D4. Schema change: No.
PERF-4 — DF-5 self-join uses bounded index scan, not seq-scan
df_scheduling_interferencejoinspgt_refresh_historyto itself on an overlap condition with a 1-hour bound. Without the index from PERF-1 this double-scan is O(N²) in history rows. Verify EXPLAIN shows nested-loop index scans (not hash or merge join over full table) for both sides of the self-join. If the planner chooses a seq-scan, addenable_seqscan = offfor the DF-5 query or restructure with a CTE.Verify: EXPLAIN of DF-5 query shows index scans on both sides of the JOIN. Dependencies: PERF-1, DF-C2. Schema change: No.
PERF-5 — History pruning batch-DELETE with short transactions
pg_trickle.history_retention_dayscleanup (shipped in v0.19.0) currently deletes rows in a single long transaction. Under self-monitoring, that transaction holds a lock onpgt_refresh_historythat can delay CDC trigger INSERTs. Rewrite the purge as batched DELETEs: delete at most 500 rows per transaction, commit between batches, sleep 50 ms between batches. The index from PERF-1 ensures each batch is an index-range scan, not a seq-scan.Verify: soak test running history purge concurrently with DF CDC trigger INSERTs; no lock wait timeout observed. Batch size configurable via
pg_trickle.history_purge_batch_sizeGUC (default 500). Dependencies: PERF-1. Schema change: No.
PERF-6 — Columnar change tracking Phase 1 — CDC bitmask
Deferred from v0.17.0 (twice) and v0.18.0. Dog-feeding now provides concrete internal workload data that justifies the schema change. Phase 1 only: compute
changed_columnsbitmask (old.col IS DISTINCT FROM new.col) in the CDC trigger for UPDATE rows; store asint8in the change buffer. Phase 2 (delta-scan filtering using the bitmask) deferred to v0.22.0. Gate behindpg_trickle.columnar_trackingGUC (defaultoff). This is the foundation for 50–90% delta volume reduction on wide-table UPDATE workloads.Verify: UPDATE a 20-column row, changing 2 columns; assert
changed_columnsbitmask has exactly 2 bits set.just check-upgrade-allpasses. Dependencies: None. Schema change: Yes (change buffer schema addition + migration script).
Scalability
| ID | Title | Effort | Priority |
|---|---|---|---|
| SCAL-1 | DF STs refresh within window at 100 user stream tables | S | P1 |
| SCAL-2 | pgt_refresh_history retention interacts correctly with self-monitoring |
S | P1 |
| SCAL-3 | 1-hour rolling window doesn’t over-aggregate when history is sparse | XS | P2 |
SCAL-1 — DF STs refresh within window at 100 user stream tables
With 100 user STs generating up to 100 history rows per 48 s window, DF-1 processes up to ~7,500 rows/hour. Verify that the DIFFERENTIAL refresh of DF-1 completes within its 48 s schedule interval at this load, leaving margin for DF-2 and DF-3. If DF-1 duration exceeds 10 s, investigate query plan and index usage. Run as part of the soak-test at high table count.
Verify: soak test with 100 STs; DF-1 refresh duration < 10 s throughout. Dependencies: PERF-1. Schema change: No.
SCAL-2 — pgt_refresh_history retention interacts correctly with self-monitoring
pg_trickle.history_retention_days(shipped in v0.19.0, default 90 days) purges old history rows. DF-1 only looks back 1 hour, so retention does not affect correctness. However the purge job must not hold a long-running lock that delays CDC trigger firing on concurrent INSERT into the history table. Verify that the cleanup job uses a DELETE … RETURNING batch strategy with short transactions to avoid blocking DF CDC triggers.Verify: E2E test running the history purge job while DF-1 is being refreshed; no lock wait timeout, no CDC trigger delay. Dependencies: DF-F1. Schema change: No.
SCAL-3 — 1-hour rolling window doesn’t over-aggregate when history is sparse
For a stream table that refreshes every 30 minutes (2 refreshes/hour), the DF-1 1-hour window contains at most 2 rows. The
AVG()aggregate is still meaningful, butpercentile_cont(0.95)over 2 rows is misleading. Document the minimum sample size (in theconfidencecolumn of DF-3) and add a note in SQL_REFERENCE.md that DF stats are most meaningful for STs refreshing every 60 s or faster.Verify: SQL_REFERENCE.md updated;
confidence = 'LOW'for STs withtotal_refreshes < 10. Dependencies: DF-A2. Schema change: No.
Ease of Use
| ID | Title | Effort | Priority |
|---|---|---|---|
| UX-1 | pgtrickle.self_monitoring_status() diagnostic function |
S | P0 |
| UX-2 | setup_self_monitoring() warm-up hint when history is sparse |
XS | P1 |
| UX-3 | NOTIFY on anomaly via pg_trickle_alert channel |
S | P1 |
| UX-4 | GETTING_STARTED.md: “Day 2 operations” section | S | P1 |
| UX-5 | explain_st() shows if a DF ST covers the queried stream table |
XS | P2 |
| UX-6 | recommend_refresh_mode() exposed in explain_st() JSON output |
XS | P2 |
| UX-7 | scheduler_overhead() output included in TUI diagnostics panel |
XS | P2 |
| UX-8 | df_threshold_advice extended with SLA headroom column |
S | P2 |
UX-1 — pgtrickle.self_monitoring_status() diagnostic function
A single-query overview of the self-monitoring analytics plane: name, last refresh timestamp, row count, and whether the DF ST is ACTIVE / SUSPENDED / NOT_CREATED. Calling this function is the first thing an operator should run to check that self-monitoring is working. Return type:
TABLE(df_name text, status text, last_refresh timestamptz, row_count bigint, note text).Verify: function returns 5 rows when all DF STs are active; returns rows with
status = 'NOT_CREATED'whensetup_self_monitoring()has not been called. Schema change: No (new function only).
UX-2 — setup_self_monitoring() warm-up hint when history is sparse
If
pgt_refresh_historyhas fewer than 50 rows whensetup_self_monitoring()is called, emit a NOTICE:"Dog-feeding stream tables created. DF analytics will populate as refresh history accumulates (currently N rows; recommend ≥ 50 before consulting df_threshold_advice)."This prevents operators from acting on meaningless LOW-confidence advice immediately after setup.Verify: call
setup_self_monitoring()on a fresh install; assert NOTICE contains the row count and the ≥ 50 recommendation. Dependencies: DF-F4. Schema change: No.
UX-3 — NOTIFY on anomaly via pg_trickle_alert channel
When
df_anomaly_signalsdetects aduration_anomaly IS NOT NULLorrecent_failures >= 2after a refresh, emit apg_notify('pg_trickle_alert', payload::text)withevent = 'self_monitor_anomaly', the stream table name, anomaly type, last duration, baseline, and a plain-English recommendation. This integrates with existing alert pipelines without requiring a new channel. Fires from a post-refresh trigger ondf_anomaly_signalsor from the auto-apply post-tick hook.Verify: E2E test LISTEN on
pg_trickle_alert; inject a 3× duration spike; assert NOTIFY payload arrives with correct anomaly type. Dependencies: DF-A1. Schema change: No.
UX-4 — GETTING_STARTED.md: “Day 2 operations” section
Add a new section to
docs/GETTING_STARTED.mdcovering the first steps after initial deployment: (1) enable self-monitoring withsetup_self_monitoring(), (2) check status withself_monitoring_status(), (3) querydf_threshold_adviceto tune thresholds, (4) set up anomaly alerting via LISTEN. This gives new users a clear post-install checklist and demonstrates the self-monitoring value proposition immediately.Verify: documentation PR reviewed; code examples in GETTING_STARTED.md execute without modification. Dependencies: UX-1, UX-2. Schema change: No.
UX-5 — explain_st() shows if a DF ST covers the queried stream table
When a user calls
pgtrickle.explain_st('my_table'), append a line"Dog-feeding coverage: df_efficiency_rolling ✓, df_threshold_advice ✓"(or"Not set up — run setup_self_monitoring()") to the output. This surfaces the analytics plane to users who might not know self-monitoring exists, without requiring a separate function call.Verify:
SELECT explain_st('any_table')output includes aself_monitoringfield in the JSON output. Dependencies: UX-1. Schema change: No.
UX-8 — df_threshold_advice extended with SLA headroom column
Extend the DF-3 defining query to include a computed
sla_headroom_mscolumn:freshness_deadline_ms - avg_diff_msfrompgt_refresh_history. Whensla_headroom_ms < 0, add a booleansla_breach_risk = trueflag so operators can see at a glance which STs risk missing their freshness SLA on the next DIFFERENTIAL cycle. Thefreshness_deadlinecolumn already exists inpgt_refresh_history(since v0.2.3). No schema change required.Verify: create an ST with a tight
freshness_deadline; run slow synthetic refreshes; assertdf_threshold_advice.sla_breach_risk = true. Dependencies: DF-A2. Schema change: No (view column addition only).
UX-6 — recommend_refresh_mode() exposed in explain_st() JSON output
explain_st()already shows self-monitoring coverage (UX-5). Extend its JSON output with arecommended_modefield reading fromdf_threshold_advice(OPS-1). If OPS-1 is not available (no DF setup), fall back tonullwith asetup_self_monitoring()hint. Keeps the single-function diagnostic surface comprehensive without requiring separate calls.Verify:
SELECT explain_st('any_table')JSON includesrecommended_modeandmode_confidencefields. Dependencies: OPS-1. Schema change: No.
UX-7 — scheduler_overhead() output included in TUI diagnostics panel
The TUI (
pgtrickle-tui) already shows refresh latency sparklines and ST status. Add a diagnostics panel (toggle keyD) showing the fields fromscheduler_overhead(): busy ratio, queue depth, and DF fraction as a percentage. Gives operators hands-on observability without needing psql.Verify: TUI diagnostics panel shows all three scheduler overhead fields;
df_refresh_fractionupdates after each DF refresh cycle. Dependencies: OPS-3. Schema change: No.
Test Coverage
| ID | Title | Effort | Priority |
|---|---|---|---|
| TEST-1 | Property test: DF-3 recommended threshold always ∈ [0.01, 0.80] | S | P0 |
| TEST-2 | Light E2E: self-monitoring create/refresh/teardown full cycle | S | P0 |
| TEST-3 | Upgrade test: pgt_refresh_history rows survive 0.19.0 → 0.20.0 |
S | P0 |
| TEST-4 | Regression test: DF STs absent from check_cdc_health() anomaly list |
XS | P1 |
| TEST-5 | Stability test: self-monitoring under 1-h soak with 50 user STs | M | P1 |
| TEST-6 | Light E2E: setup_self_monitoring() idempotency (3× call) |
XS | P1 |
TEST-1 — Property test: DF-3 recommended threshold always ∈ [0.01, 0.80]
Implements CORR-1 as a
proptestunit test. Generate random(avg_diff_ms: 0.0–100_000.0, avg_full_ms: 0.0–100_000.0, current: 0.01–0.80)triples, compute the DF-3 CASE expression in Rust, assert output ∈ [0.01, 0.80]. Can be a pure Rust unit test insrc/refresh.rsalongside the existingcompute_adaptive_thresholdtests — no database required.Verify:
just test-unitpasses; 10,000 proptest iterations with zero failures. Dependencies: CORR-1. Schema change: No.
TEST-2 — Light E2E: self-monitoring create/refresh/teardown full cycle
A light E2E test (stock
postgres:18.3container) that: (1) installs the extension, (2) creates 3 user STs, (3) runs 5 refresh cycles to populate history, (4) callssetup_self_monitoring(), (5) refreshes all DF STs once, (6) assertsself_monitoring_status()shows 5 active STs, (7) callsteardown_self_monitoring(), (8) asserts all DF STs are gone.Verify: test passes in
just test-light-e2ewith zero assertions failed. Schema change: No.
TEST-3 — Upgrade test: pgt_refresh_history rows survive 0.19.0 → 0.20.0
The 0.19.0 → 0.20.0 migration adds an index to
pgt_refresh_history(PERF-1). The upgrade must not truncate, reorder, or modify existing history rows. Write an upgrade E2E test: deploy 0.19.0, run 10 refreshes,ALTER EXTENSION pg_trickle UPDATE, assert all 10 history rows are intact and the new index exists.Verify: upgrade E2E test passes;
SELECT count(*) FROM pgt_refresh_historyunchanged after upgrade. Schema change: Yes (index).
TEST-4 — Regression test: DF STs absent from check_cdc_health() anomaly list
pgtrickle.check_cdc_health()scans all stream tables for CDC anomalies. Aftersetup_self_monitoring(), DF STs must not appear in the anomaly list just because they are refreshed at longer intervals (48–96 s). Their schedules must be recognised as intentionally relaxed, not “falling behind”.Verify: E2E test:
setup_self_monitoring()→ wait one full DF cycle → assertcheck_cdc_health()returns no anomalies for anydf_table. Dependencies: DF-F4. Schema change: No.
TEST-5 — Stability test: self-monitoring under 1-h soak with 50 user STs
Extends DF-D4. Runs 50 user STs + 5 DF STs for 1 hour under steady insert load (1 000 rows/min across all sources). Assertions: (a) all DF STs remain ACTIVE, (b) no OOM or background worker crash, © DF-1 avg refresh duration < 5 s throughout, (d)
pgtrickle.self_monitoring_status()shows 5 active STs at end of run.Verify: soak test passes with all four assertions. Dependencies: DF-D4, SCAL-1. Schema change: No.
TEST-6 — Light E2E: setup_self_monitoring() idempotency (3× call)
Implements STAB-1 as a light E2E test. Call
setup_self_monitoring()three consecutive times in the same session. Assert: no errors, exactly fivedf_stream tables inpgt_stream_tables, no duplicate triggers inpg_triggerfor history table.Verify: test passes in
just test-light-e2e;SELECT count(*) FROM pgtrickle.pgt_stream_tables WHERE pgt_name LIKE 'df_%' = 5after all three calls. Dependencies: STAB-1. Schema change: No.
Conflicts & Risks
PERF-1 (index addition) requires a migration script change. Adding
CREATE INDEX CONCURRENTLYto the 0.19.0 → 0.20.0 migration must be tested withjust check-upgrade-all.CONCURRENTLYcannot run inside a transaction block — the migration must issue it outside the default single-transaction DDL wrapper.UX-3 (NOTIFY on anomaly) fires from a post-refresh path. If the
pg_notify()call fails (e.g., payload too large), it must not roll back the DF-2 refresh. Wrap the notify in aBEGIN … EXCEPTION WHEN OTHERS THEN NULL ENDblock, or fire it from a deferred trigger.STAB-3 (DROP EXTENSION cycle) requires DF STs to be extension-owned or cleanly unregistered. If DF STs are not extension-owned objects,
DROP EXTENSION CASCADEwill not drop them. Either register them as extension members or document thatteardown_self_monitoring()must be called beforeDROP EXTENSION.TEST-5 (soak test) overlaps with the existing soak test in CI. Add it to the daily
stability-tests.ymlworkflow rather thanci.ymlto avoid extending PR CI time. Mark with#[ignore]and trigger viajust test-soak.CORR-5 / PERF-4 interaction. The
start_time > now() - interval '1 hour'boundary and the index depend on the planner choosing an index range scan. On very busy deployments where the cardinality estimate is off, the planner may prefer a seq-scan. Consider addingSET enable_seqscan = offinside the DF stream table queries if plan stability is a concern.PERF-6 (columnar tracking) is a schema change — deferred twice already. The
changed_columnscolumn addition to all change buffer tables requires a migration script. Gate strictly behindpg_trickle.columnar_tracking = offdefault. If capacity is tight, PERF-6 can be cut from v0.20.0 without affecting any other item — it shares no code paths with the DF pipeline.OPS-2 (
check_cdc_health()enrichment) has a fallback requirement. Whensetup_self_monitoring()has not been called, the function must fall back to the old full-scan path without error. Guard with a catalog check fordf_cdc_buffer_trendsexistence before querying it.OPS-4 (
explain_dag()) output size. At 100+ user STs the Mermaid output may exceed typical terminal width. Offerformat => 'dot'andlimit => Narguments to constrain output. Defaultformat => 'mermaid'with aNOTICEwhen DAG has > 20 nodes.OPS-6 (workload-aware poll) writes to the scheduler hot path. The
compute_adaptive_poll_ms()function is called on every scheduler tick. The DF-5 read must be a single O(1) catalog lookup (latest row only), not a full table scan. Guard withLIMIT 1 ORDER BY collected_at DESC. If the DF-5 table does not exist (self-monitoring not set up), fall back to the old backoff logic without error.DASH-1 (Grafana) depends on postgres-exporter SQL queries. The dashboard panels use custom SQL collectors in the postgres-exporter config. Verify that
monitoring/docker-compose already mounts query config; if not, add apg_trickle_df_queries.yamlcollector file alongside the existing exporter config.DBT-1 macro idempotency. The
pgtrickle_enable_monitoringmacro callssetup_self_monitoring()on everydbt run. Document that this is intentionally safe (STAB-1) and adds < 5 ms overhead per run.
v0.20.0 total: ~3–4 weeks
Exit criteria:
- [x] DF-F1: pgt_refresh_history receives CDC INSERT triggers when create_stream_table() is called
- [x] DF-F2: df_efficiency_rolling created and refreshes correctly in DIFFERENTIAL mode
- [x] DF-F3: DF-1 output matches refresh_efficiency() results on synthetic history
- [x] DF-F4: setup_self_monitoring() creates all five df_* stream tables in one call
- [x] DF-F5: teardown_self_monitoring() drops all df_* tables cleanly with no orphaned triggers
- [x] DF-A1: df_anomaly_signals created and detects 3× duration spikes
- [x] DF-A2: df_threshold_advice provides HIGH-confidence recommendations after ≥ 20 refresh cycles
- [x] DF-A3: DAG ensures DF-1 refreshes before DF-2 and DF-3 in every scheduler tick
- [x] DF-C1: df_cdc_buffer_trends created (FULL or DIFFERENTIAL mode)
- [x] DF-C2: df_scheduling_interference detects overlapping concurrent refreshes
- [x] DF-G1: pg_trickle.self_monitoring_auto_apply GUC registered with default off
- [x] DF-G2: Auto-apply adjusts threshold with ≥ 1 confirmed change in E2E test
- [x] DF-G5: Rate limiting verified — no more than 1 change per ST per 10 minutes
- [x] DF-D3: Suspending all df_* STs does not affect control-plane operation
- [x] CORR-1: df_threshold_advice output always within [0.01, 0.80] (property test)
- [x] CORR-2: No false-positive DURATION_SPIKE on first-ever refresh of a new ST
- [x] CORR-3: avg_change_ratio is NULL or in [0, 1] for zero-delta sources
- [x] CORR-4: Only INSERT triggers (no UPDATE/DELETE) on pgt_refresh_history
- [x] STAB-1: setup_self_monitoring() called 3× produces no errors and no duplicates
- [x] STAB-2: Auto-apply worker logs WARNING (not panic) when ALTER target disappears
- [x] STAB-3: DROP EXTENSION + CREATE EXTENSION + setup_self_monitoring() cycle works cleanly
- [x] PERF-1: pgt_refresh_history(pgt_id, start_time) index exists and is used by DF queries
- [x] PERF-2: DF-1 read ≥ 5× faster than refresh_efficiency() at 10 K history rows
- [x] UX-1: pgtrickle.self_monitoring_status() returns correct status for all five DF STs
- [x] UX-2: setup_self_monitoring() emits warm-up NOTICE when history has < 50 rows
- [x] UX-3: pg_trickle_alert NOTIFY received within one DF cycle after a 3× duration spike
- [x] TEST-1: Proptest for DF-3 threshold bounds passes 10,000 iterations
- [x] TEST-2: Light E2E full cycle test passes
- [x] TEST-3: Upgrade E2E: history rows intact and index present after 0.19.0 → 0.20.0
- [x] TEST-4: check_cdc_health() reports no anomalies for df_* tables after setup
- [x] OPS-1: recommend_refresh_mode() returns mode ∈ {'DIFFERENTIAL','FULL','AUTO'} and confidence ∈ {'HIGH','MEDIUM','LOW'}
- [x] OPS-2: check_cdc_health() returns spill-risk alert when buffer growth rate extrapolates to breach threshold within 2 cycles
- [x] OPS-3: scheduler_overhead() returns non-NULL fields after ≥ 5 refresh cycles; df_refresh_fraction < 0.01 in soak test
- [x] OPS-4: explain_dag() output contains all five df_* nodes after setup_self_monitoring()
- [x] OPS-5: sql/self_monitoring_setup.sql executes without errors on a fresh install
- [x] PERF-5: Concurrent history purge + DF CDC INSERT produces no lock wait timeouts in soak test
- [x] PERF-6: changed_columns bitmask stored in change buffer for UPDATE rows when columnar_tracking = on (if included)
- [x] OPS-6: Soak test shows lower overlap_count in DF-5 with workload-aware poll enabled vs disabled
- [x] DASH-1: docker compose up in monitoring/ loads pg_trickle_self_monitoring dashboard; all 5 panels show data
- [x] DBT-1: pgtrickle_enable_monitoring macro runs twice without error; self_monitoring_status() shows 5 active STs after both calls
- [x] UX-8: df_threshold_advice.sla_breach_risk = true when avg_diff_ms > freshness_deadline_ms on synthetic data
- [x] Extension upgrade path tested (0.19.0 → 0.20.0)
- [x] just check-version-sync passes