Plain-language companion: v0.20.0.md

v0.20.0 — Self-Monitoring (pg_trickle Monitors Itself)

Status: Released (2026-04-15). All 62 items implemented, 1 skipped (PERF-6 already shipped in v0.19.0).

Release Theme This release implements self-monitoring: pg_trickle uses its own stream tables to maintain reactive analytics over its internal catalog and refresh-history tables. Five self-monitoring stream tables (df_efficiency_rolling, df_anomaly_signals, df_threshold_advice, df_cdc_buffer_trends, df_scheduling_interference) replace repeated full-scan diagnostic functions with continuously-maintained incremental views, enable multi-cycle trend detection for threshold tuning, and surface anomalies reactively. An optional auto-apply policy layer can automatically adjust auto_threshold when confidence is high. This validates pg_trickle on its own non-trivial workload and demonstrates the incremental analytics value proposition to users.

See plans/PLAN_SELF_MONITORING.md for the full design, architecture, and risk analysis.

Phase 1 — Foundation

Item Description Effort Ref
DF-F1 Verify CDC on pgt_refresh_history. Confirm that create_stream_table() installs INSERT triggers on pgt_refresh_history. Fix schema-exclusion logic if the pgtrickle schema is skipped. 2–4h PLAN_SELF_MONITORING.md §7 Phase 1
DF-F2 Create df_efficiency_rolling (DF-1). Maintained rolling-window aggregates over pgt_refresh_history. Replaces refresh_efficiency() full scans. 2–4h PLAN_SELF_MONITORING.md §5 DF-1
DF-F3 E2E test: DF-1 output matches refresh_efficiency(). Insert synthetic history rows, refresh DF-1, assert aggregates agree. 2–4h PLAN_SELF_MONITORING.md §8
DF-F4 pgtrickle.setup_self_monitoring() helper. Single SQL call that creates all five df_* stream tables. 2–4h PLAN_SELF_MONITORING.md §7 Phase 4
DF-F5 pgtrickle.teardown_self_monitoring() helper. Drops all df_* stream tables cleanly. 1h PLAN_SELF_MONITORING.md §7 Phase 4

Phase 2 — Anomaly Detection

Item Description Effort Ref
DF-A1 Create df_anomaly_signals (DF-2). Detects duration spikes, error bursts, and mode oscillation by comparing recent behavior against DF-1 baselines. 3–5h PLAN_SELF_MONITORING.md §5 DF-2
DF-A2 Create df_threshold_advice (DF-3). Multi-cycle threshold recommendation replacing the single-step compute_adaptive_threshold() convergence. 3–5h PLAN_SELF_MONITORING.md §5 DF-3
DF-A3 Verify DAG ordering. DF-1 refreshes before DF-2 and DF-3. 1–2h PLAN_SELF_MONITORING.md §7 Phase 2
DF-A4 E2E test: threshold spike detection. Inject synthetic history making DIFF consistently fast; assert DF-3 recommends raising the threshold. 2–4h PLAN_SELF_MONITORING.md §8
DF-A5 E2E test: anomaly duration spike. Inject a 3× duration spike; assert DF-2 detects it. 2–4h PLAN_SELF_MONITORING.md §8

Phase 3 — CDC Buffer & Interference

Item Description Effort Ref
DF-C1 Create df_cdc_buffer_trends (DF-4). Tracks change-buffer growth rates per source table. May require pgtrickle.cdc_buffer_row_counts() helper for dynamic table names. 4–8h PLAN_SELF_MONITORING.md §5 DF-4
DF-C2 Create df_scheduling_interference (DF-5). Detects concurrent refresh overlap. FULL-refresh mode initially (bounded 1-hour window). 3–5h PLAN_SELF_MONITORING.md §5 DF-5
DF-C3 E2E test: scheduling overlap detection. Create 3 STs with overlapping schedules; verify DF-5 detects overlap. 2–4h PLAN_SELF_MONITORING.md §8

Phase 4 — GUC & Auto-Apply

Item Description Effort Ref
DF-G1 pg_trickle.self_monitoring_auto_apply GUC. Values: off (default) / threshold_only / full. Registered in src/config.rs. 1–2h PLAN_SELF_MONITORING.md §6.2
DF-G2 Auto-apply worker (threshold_only). Post-tick hook reads df_threshold_advice; applies ALTER STREAM TABLE ... SET auto_threshold = <recommended> when confidence is HIGH and delta > 5%. Rate-limited to 1 change per ST per 10 minutes. 4–8h PLAN_SELF_MONITORING.md §7 Phase 5
DF-G3 initiated_by = 'SELF_MONITOR' audit trail. Log auto-apply changes to pgt_refresh_history. 1–2h PLAN_SELF_MONITORING.md §7 Phase 5
DF-G4 E2E test: auto-apply threshold. Enable threshold_only, inject history making DIFF consistently faster, verify threshold increases automatically. 2–4h PLAN_SELF_MONITORING.md §8
DF-G5 E2E test: rate limiting. Verify no more than 1 threshold change per ST per 10 minutes. 1–2h PLAN_SELF_MONITORING.md §8

Phase 5 — Operational Diagnostics

Item Description Effort Ref
OPS-1 pgtrickle.recommend_refresh_mode(st_name) Reads df_threshold_advice to return a structured recommendation { mode, confidence, reason } rather than computing on demand. 2–4h PLAN_SELF_MONITORING.md §10.6
OPS-2 check_cdc_health() spill-risk enrichment. Query df_cdc_buffer_trends growth rate; emit a spill_risk alert when buffer growth will breach spill_threshold_blocks within 2 cycles. 2–4h PLAN_SELF_MONITORING.md §10.3
OPS-3 pgtrickle.scheduler_overhead() diagnostic function. Returns busy-time ratio, queue depth, avg dispatch latency, and fraction of CPU spent on DF STs vs user STs. 2–4h
OPS-4 pgtrickle.explain_dag() — Mermaid/DOT output. Returns DAG as Mermaid markdown with node colours: user=blue, self-monitoring=green, suspended=red. 3–4h
OPS-5 sql/self_monitoring_setup.sql quick-start template. Runnable script: call setup_self_monitoring(), set self_monitoring_auto_apply = 'threshold_only', configure LISTEN, query initial recommendations. 1h
OPS-6 Workload-aware poll intervals via DF-5 signal. Replace compute_adaptive_poll_ms() exponential backoff with pre-emptive dispatch interval widening when df_scheduling_interference detects contention. 2–4h PLAN_SELF_MONITORING.md §10.2
DASH-1 Grafana Self-Monitoring Dashboard. New monitoring/grafana/dashboards/pg_trickle_self_monitoring.json — 5 panels reading from DF-1 through DF-5. 4–6h PLAN_SELF_MONITORING.md §10.5
DBT-1 dbt pgtrickle_enable_monitoring post-hook macro. Calls setup_self_monitoring() automatically after a successful dbt run; documented in dbt-pgtrickle/. 2h

OPS-1 — pgtrickle.recommend_refresh_mode(st_name text)

Reads directly from df_threshold_advice instead of computing a single-cycle cost comparison on demand (PLAN_SELF_MONITORING.md §10.6). Returns TABLE(mode text, confidence text, reason text). When confidence is LOW (< 10 history rows), emits a fallback with mode='AUTO' and a reason explaining insufficient data. Integrates with explain_st() output.

Verify: call on an ST with ≥ 20 history cycles; assert mode{'DIFFERENTIAL','FULL','AUTO'} and confidence{'HIGH','MEDIUM','LOW'}. Dependencies: DF-A2. Schema change: No.

OPS-2 — check_cdc_health() spill-risk enrichment

Currently check_cdc_health() performs full-table scans to detect anomalies. When DF-C1 is active, query df_cdc_buffer_trends growth rate instead. Emit a spill_risk = 'IMMINENT' row when the 1-cycle growth rate extrapolated 2 cycles ahead exceeds spill_threshold_blocks. Falls back to full scan when self-monitoring is not set up.

Verify: inject 80% of spill_threshold_blocks worth of buffer rows with a steep growth rate; assert check_cdc_health() returns a spill-risk alert. Dependencies: DF-C1. Schema change: No.

OPS-3 — pgtrickle.scheduler_overhead() diagnostic function

Returns a snapshot of scheduler efficiency: scheduler_busy_ratio (fraction of wall-clock time spent executing refreshes), queue_depth (STs waiting to be dispatched), avg_dispatch_latency_ms, df_refresh_fraction (fraction of busy time attributable to DF STs). This makes PERF-3’s < 1% CPU target observable in production without custom monitoring.

Verify: function returns non-NULL values after 5+ refresh cycles; assert df_refresh_fraction < 0.01 in the soak test context. Dependencies: DF-D4. Schema change: No (new function only).

OPS-4 — pgtrickle.explain_dag() — Mermaid / DOT graph output

Returns the full refresh DAG as a Mermaid markdown string (default) or Graphviz DOT (via format => 'dot' argument). Node labels show ST name, current mode, and refresh interval. Node colours: user STs = blue, self-monitoring STs = green, suspended = red, fused = orange. Edges show dependency direction. Validates that DF-1 → DF-2 → DF-3 ordering is correct post-setup.

Verify: SELECT pgtrickle.explain_dag() after setup_self_monitoring() returns a string containing all five df_ nodes in green with correct edges. Dependencies: None. Schema change: No (new function only).

OPS-5 — sql/self_monitoring_setup.sql quick-start template

A standalone SQL script in sql/ that an operator can run with psql -f sql/self_monitoring_setup.sql. Contents: calls setup_self_monitoring(), sets pg_trickle.self_monitoring_auto_apply = 'threshold_only', runs LISTEN pg_trickle_alert, queries self_monitoring_status() for a status summary, and queries df_threshold_advice for initial recommendations with a warm-up note. Referenced from GETTING_STARTED.md Day 2 operations section (UX-4).

Verify: script executes without errors on a fresh install; produces visible output showing 5 active DF STs. Dependencies: DF-F4, DF-G1, UX-4. Schema change: No.

OPS-6 — Workload-aware poll intervals via DF-5 signal

Currently compute_adaptive_poll_ms() uses pure exponential backoff that reacts to contention only after it occurs. Replace this with a pre-emptive signal: after each scheduler tick, read the latest overlap_count from df_scheduling_interference; if overlap_count >= 2, increase the dispatch interval for the next tick by 20% before dispatching (capped at pg_trickle.max_poll_interval_ms). This closes the self-monitoring feedback loop by letting the analytics directly influence scheduling policy, reducing contention on write-heavy deployments without waiting for timeouts.

Verify: soak test with known-contending STs shows lower overlap_count in DF-5 with signal enabled vs disabled. scheduler_overhead() shows reduced busy-time ratio. Dependencies: DF-C2, OPS-3. Schema change: No.

DASH-1 — Grafana Self-Monitoring Dashboard

Add monitoring/grafana/dashboards/pg_trickle_self_monitoring.json alongside the existing pg_trickle_overview.json. Five panels: (1) Refresh throughput timeline (DF-1 avg_diff_ms over time), (2) Anomaly heatmap (DF-2 per-ST anomaly type grid), (3) Threshold calibration scatter (DF-3 current vs recommended threshold), (4) CDC buffer growth sparklines (DF-4 per-source growth rate), (5) Interference matrix (DF-5 overlap heatmap). Provisioned automatically in monitoring/grafana/provisioning/.

Verify: docker compose up in monitoring/ loads both dashboards; all five panels resolve without No data errors using the postgres-exporter queries. Dependencies: DF-F2, DF-A1, DF-A2, DF-C1, DF-C2. Schema change: No.

DBT-1 — pgtrickle_enable_monitoring dbt post-hook macro

Add a pgtrickle_enable_monitoring macro to dbt-pgtrickle/macros/ that calls {{ pgtrickle.setup_self_monitoring() }} and emits a log() message confirming activation. Documented in dbt-pgtrickle/README.md. Users add +post-hook: "{{ pgtrickle_enable_monitoring() }}" to dbt_project.yml to auto-enable monitoring after any dbt run. Idempotent — safe to call on every run because setup_self_monitoring() is already idempotent (STAB-1).

Verify: just test-dbt includes a test case that runs the macro twice; asserts self_monitoring_status() shows 5 active STs after both calls. Dependencies: DF-F4, STAB-1. Schema change: No.

Documentation & Safety

Item Description Effort Ref
DF-D1 SQL_REFERENCE.md: self-monitoring quick start. Document setup_self_monitoring(), teardown_self_monitoring(), all five df_* stream tables, and the auto-apply GUC. 2–4h
DF-D2 CONFIGURATION.md: pg_trickle.self_monitoring_auto_apply GUC. 1h
DF-D3 E2E test: control plane survives DF ST suspension. Drop or suspend all df_* STs; verify the scheduler and refresh logic operate identically. 2–4h PLAN_SELF_MONITORING.md §8
DF-D4 Soak test addition. Add self-monitoring STs to the existing soak test; verify no memory growth or scheduler stalls under 1-hour sustained load. 2–4h PLAN_SELF_MONITORING.md §8

Correctness

ID Title Effort Priority
CORR-1 df_threshold_advice output always within [0.01, 0.80] S P0
CORR-2 DF-2 suppresses false-positive spike on first-ever refresh S P0
CORR-3 avg_change_ratio never NaN/Inf on zero-delta streams S P0
CORR-4 CDC INSERT-only invariant verified on pgt_refresh_history XS P1
CORR-5 DF-1 historical window boundary is exclusive, not inclusive XS P1

CORR-1 — df_threshold_advice output always within [0.01, 0.80]

The LEAST(0.80, GREATEST(0.01, …)) expression in DF-3 must hold for all input combinations including NULL avg_diff_ms, zero avg_full_ms, and extreme ratios. Add a property-based test (proptest) that generates random (avg_diff_ms, avg_full_ms, current_threshold) triples and asserts the output is always in the valid range. Any value outside [0.01, 0.80] that reaches auto-apply would corrupt stream table configuration.

Verify: proptest with 10,000 iterations; zero out-of-range results. Dependencies: DF-A2. Schema change: No.

CORR-2 — DF-2 suppresses false-positive spike on first-ever refresh

df_anomaly_signals compares latest.duration_ms against eff.avg_diff_ms. On the very first refresh of a stream table there is no rolling average yet (eff.avg_diff_ms IS NULL), so the CASE WHEN would produce no anomaly. Confirm the LATERAL subquery returns NULL (not 0) when history is empty, and that the CASE guard is > 3.0 * NULLIF(eff.avg_diff_ms, 0) so a NULL baseline never triggers a spike.

Verify: E2E test creating a brand-new ST; assert duration_anomaly IS NULL on first DF-2 refresh. Dependencies: DF-A1. Schema change: No.

CORR-3 — avg_change_ratio never NaN/Inf on zero-delta streams

DF-1 computes avg(h.delta_row_count::float / NULLIF(h.rows_inserted + h.rows_deleted, 0)). If a stream table runs only FULL refreshes (no DIFF cycles) the divisor is always NULL and avg() returns NULL — correct. But if DIFF runs with exactly zero rows inserted and zero deleted (CDC buffer was empty), NULLIF must prevent a divide-by-zero NaN. Verify the guard holds and that avg_change_ratio is either a valid float in [0, 1] or NULL.

Verify: E2E test triggering a DIFF refresh on a quiescent source; assert avg_change_ratio IS NULL OR avg_change_ratio BETWEEN 0 AND 1. Dependencies: DF-F2. Schema change: No.

CORR-4 — CDC INSERT-only invariant verified on pgt_refresh_history

pgt_refresh_history is semantically append-only: rows are only ever INSERTed (one per refresh). The CDC trigger installed by DF-F1 must be an INSERT-only trigger (no UPDATE/DELETE triggers). If the trigger were registered as FOR EACH ROW AFTER INSERT OR UPDATE, a future catalog UPDATE would generate spurious change-buffer rows and corrupt DF-1 aggregates. Inspect pg_trigger to confirm only an INSERT trigger exists.

Verify: SELECT tgtype FROM pg_trigger WHERE tgrelid = 'pgtrickle.pgt_refresh_history'::regclass returns only INSERT-event triggers. Dependencies: DF-F1. Schema change: No.

CORR-5 — DF-1 historical window boundary is exclusive, not inclusive

The WHERE h.start_time > now() - interval '1 hour' clause uses a strict > comparison. This ensures a row with start_time exactly equal to the boundary is excluded on each pass, preventing double-counting in rolling aggregates. Confirm the query plan uses the index on (pgt_id, start_time) (see PERF-2) and that the boundary is consistent across DF-1, DF-2, and DF-4 (all use the same 1-hour lookback).

Verify: unit test comparing aggregate output with a row at the exact boundary; assert it is excluded. Dependencies: DF-F2. Schema change: No.


Stability

ID Title Effort Priority
STAB-1 setup_self_monitoring() is fully idempotent S P0
STAB-2 Auto-apply handles ALTER STREAM TABLE failure gracefully S P0
STAB-3 DF STs survive DROP EXTENSION + CREATE EXTENSION cycle S P1
STAB-4 Auto-apply worker checks ST still exists before applying XS P1
STAB-5 teardown_self_monitoring() is safe when some DF STs already removed XS P1

STAB-1 — setup_self_monitoring() is fully idempotent

Calling setup_self_monitoring() a second time while DF STs already exist must not raise an error. Use IF NOT EXISTS semantics internally (or check catalog before creating). The function must also be safe to call concurrently from two sessions. Idempotency is critical for upgrade scripts and Terraform-style declarative deployment workflows.

Verify: call setup_self_monitoring() three times in a row; no errors, no duplicate stream tables. Dependencies: DF-F4. Schema change: No.

STAB-2 — Auto-apply handles ALTER STREAM TABLE failure gracefully

The auto-apply post-tick hook reads df_threshold_advice and issues ALTER STREAM TABLE … SET auto_threshold = <recommended>. If the stream table was dropped between the advice read and the apply (a TOCTOU race), the ALTER will error. Catch SQL errors in the post-tick hook with an appropriate match on PgTrickleError and log a WARNING rather than crashing the background worker.

Verify: unit test with a mocked ALTER that returns ERROR: relation does not exist; assert the worker logs a warning and continues to the next advice row. Dependencies: DF-G2. Schema change: No.

STAB-3 — DF STs survive DROP EXTENSION + CREATE EXTENSION cycle

DROP EXTENSION pg_trickle CASCADE drops all extension-owned objects. After CREATE EXTENSION pg_trickle, setup_self_monitoring() should recreate the DF STs cleanly. There must be no leftover triggers, orphaned change buffer tables, or stale catalog rows from the previous installation. This is the most likely failure mode after an emergency rollback + reinstall.

Verify: E2E test: setup_self_monitoring()DROP EXTENSION CASCADECREATE EXTENSIONsetup_self_monitoring() → insert history → refresh DF-1; assert correct aggregates. Dependencies: DF-F4, DF-F5. Schema change: No.

STAB-4 — Auto-apply worker checks ST still exists before applying

Before issuing ALTER STREAM TABLE, the worker should confirm the ST is still in pgt_stream_tables and is not in SUSPENDED or FUSED state. Applying a threshold change to a SUSPENDED ST is harmless but wasteful; applying to a FUSED ST is wrong (the fuse exists for a reason). Add a pre-apply guard in the Rust post-tick hook.

Verify: E2E test suspending an ST manually while auto-apply is enabled; assert no threshold change is applied-to a suspended stream table. Dependencies: DF-G2. Schema change: No.

STAB-5 — teardown_self_monitoring() is safe when some DF STs already removed

If a user manually drops df_anomaly_signals before calling teardown_self_monitoring(), the teardown function must not error on DROP STREAM TABLE df_anomaly_signals. Use drop_stream_table(name, if_exists => true) semantics for each DF table in the teardown. Otherwise a partial teardown leaves the system in an inconsistent state.

Verify: drop two DF STs manually, then call teardown_self_monitoring(); assert no errors and remaining DF STs are gone. Dependencies: DF-F5. Schema change: No.


Performance

ID Title Effort Priority
PERF-1 Index on pgt_refresh_history(pgt_id, start_time) for DF queries XS P0
PERF-2 Benchmark DF-1 vs refresh_efficiency() on 10 K history rows S P0
PERF-3 Dog-feeding scheduler overhead target: < 1% of total CPU S P1
PERF-4 DF-5 self-join uses bounded index scan, not seq-scan S P1
PERF-5 History pruning batch-DELETE with short transactions (no CDC lock contention) S P1
PERF-6 Columnar change tracking Phase 1 — CDC bitmask (deferred from v0.17/v0.18) M P1

PERF-1 — Index on pgt_refresh_history(pgt_id, start_time) for DF queries

All five DF stream tables filter pgt_refresh_history on (pgt_id, start_time). Without a composite index on these columns the rolling-window WHERE clause forces a sequential scan of the growing history table. Verify the index was created during extension install (check the upgrade migration); if missing, add it as part of the 0.19.0 → 0.20.0 migration script.

Verify: EXPLAIN (FORMAT TEXT) SELECT … FROM pgtrickle.pgt_refresh_history WHERE pgt_id = 1 AND start_time > now() - interval '1 hour' shows an index scan. Schema change: Yes (index addition in migration script).

PERF-2 — Benchmark DF-1 vs refresh_efficiency() on 10 K history rows

The primary performance claim of self-monitoring is that a maintained DIFFERENTIAL stream table is cheaper than scanning the full history table on every diagnostic call. Establish a Criterion micro-benchmark that seeds 10 K history rows, then compares: (a) a full SELECT * FROM pgtrickle.refresh_efficiency() call vs (b) a SELECT * FROM pgtrickle.df_efficiency_rolling read after one incremental refresh. The benchmark documents the win concretely.

Verify: Criterion benchmark shows DF-1 read is at least 5× faster than refresh_efficiency() at 10 K rows. Included in benches/ and run in CI. Dependencies: DF-F2. Schema change: No.

PERF-3 — Dog-feeding scheduler overhead target: < 1% of total CPU

Five DF STs at 48–96 s schedules add background refresh work. Under a realistic load (20 user STs, 10 K history rows), the total time spent refreshing DF STs should be < 1% of total scheduler CPU. Measure in the E2E soak test by comparing scheduler loop busy-time with and without DF STs. If overhead exceeds 1%, relax schedules to 120 s or move DF STs to refresh_tier = 'cold'.

Verify: soak test reports DF refresh overhead as a fraction of total scheduler CPU; assert < 1%. Dependencies: DF-D4. Schema change: No.

PERF-4 — DF-5 self-join uses bounded index scan, not seq-scan

df_scheduling_interference joins pgt_refresh_history to itself on an overlap condition with a 1-hour bound. Without the index from PERF-1 this double-scan is O(N²) in history rows. Verify EXPLAIN shows nested-loop index scans (not hash or merge join over full table) for both sides of the self-join. If the planner chooses a seq-scan, add enable_seqscan = off for the DF-5 query or restructure with a CTE.

Verify: EXPLAIN of DF-5 query shows index scans on both sides of the JOIN. Dependencies: PERF-1, DF-C2. Schema change: No.

PERF-5 — History pruning batch-DELETE with short transactions

pg_trickle.history_retention_days cleanup (shipped in v0.19.0) currently deletes rows in a single long transaction. Under self-monitoring, that transaction holds a lock on pgt_refresh_history that can delay CDC trigger INSERTs. Rewrite the purge as batched DELETEs: delete at most 500 rows per transaction, commit between batches, sleep 50 ms between batches. The index from PERF-1 ensures each batch is an index-range scan, not a seq-scan.

Verify: soak test running history purge concurrently with DF CDC trigger INSERTs; no lock wait timeout observed. Batch size configurable via pg_trickle.history_purge_batch_size GUC (default 500). Dependencies: PERF-1. Schema change: No.

PERF-6 — Columnar change tracking Phase 1 — CDC bitmask

Deferred from v0.17.0 (twice) and v0.18.0. Dog-feeding now provides concrete internal workload data that justifies the schema change. Phase 1 only: compute changed_columns bitmask (old.col IS DISTINCT FROM new.col) in the CDC trigger for UPDATE rows; store as int8 in the change buffer. Phase 2 (delta-scan filtering using the bitmask) deferred to v0.22.0. Gate behind pg_trickle.columnar_tracking GUC (default off). This is the foundation for 50–90% delta volume reduction on wide-table UPDATE workloads.

Verify: UPDATE a 20-column row, changing 2 columns; assert changed_columns bitmask has exactly 2 bits set. just check-upgrade-all passes. Dependencies: None. Schema change: Yes (change buffer schema addition + migration script).


Scalability

ID Title Effort Priority
SCAL-1 DF STs refresh within window at 100 user stream tables S P1
SCAL-2 pgt_refresh_history retention interacts correctly with self-monitoring S P1
SCAL-3 1-hour rolling window doesn’t over-aggregate when history is sparse XS P2

SCAL-1 — DF STs refresh within window at 100 user stream tables

With 100 user STs generating up to 100 history rows per 48 s window, DF-1 processes up to ~7,500 rows/hour. Verify that the DIFFERENTIAL refresh of DF-1 completes within its 48 s schedule interval at this load, leaving margin for DF-2 and DF-3. If DF-1 duration exceeds 10 s, investigate query plan and index usage. Run as part of the soak-test at high table count.

Verify: soak test with 100 STs; DF-1 refresh duration < 10 s throughout. Dependencies: PERF-1. Schema change: No.

SCAL-2 — pgt_refresh_history retention interacts correctly with self-monitoring

pg_trickle.history_retention_days (shipped in v0.19.0, default 90 days) purges old history rows. DF-1 only looks back 1 hour, so retention does not affect correctness. However the purge job must not hold a long-running lock that delays CDC trigger firing on concurrent INSERT into the history table. Verify that the cleanup job uses a DELETE … RETURNING batch strategy with short transactions to avoid blocking DF CDC triggers.

Verify: E2E test running the history purge job while DF-1 is being refreshed; no lock wait timeout, no CDC trigger delay. Dependencies: DF-F1. Schema change: No.

SCAL-3 — 1-hour rolling window doesn’t over-aggregate when history is sparse

For a stream table that refreshes every 30 minutes (2 refreshes/hour), the DF-1 1-hour window contains at most 2 rows. The AVG() aggregate is still meaningful, but percentile_cont(0.95) over 2 rows is misleading. Document the minimum sample size (in the confidence column of DF-3) and add a note in SQL_REFERENCE.md that DF stats are most meaningful for STs refreshing every 60 s or faster.

Verify: SQL_REFERENCE.md updated; confidence = 'LOW' for STs with total_refreshes < 10. Dependencies: DF-A2. Schema change: No.


Ease of Use

ID Title Effort Priority
UX-1 pgtrickle.self_monitoring_status() diagnostic function S P0
UX-2 setup_self_monitoring() warm-up hint when history is sparse XS P1
UX-3 NOTIFY on anomaly via pg_trickle_alert channel S P1
UX-4 GETTING_STARTED.md: “Day 2 operations” section S P1
UX-5 explain_st() shows if a DF ST covers the queried stream table XS P2
UX-6 recommend_refresh_mode() exposed in explain_st() JSON output XS P2
UX-7 scheduler_overhead() output included in TUI diagnostics panel XS P2
UX-8 df_threshold_advice extended with SLA headroom column S P2

UX-1 — pgtrickle.self_monitoring_status() diagnostic function

A single-query overview of the self-monitoring analytics plane: name, last refresh timestamp, row count, and whether the DF ST is ACTIVE / SUSPENDED / NOT_CREATED. Calling this function is the first thing an operator should run to check that self-monitoring is working. Return type: TABLE(df_name text, status text, last_refresh timestamptz, row_count bigint, note text).

Verify: function returns 5 rows when all DF STs are active; returns rows with status = 'NOT_CREATED' when setup_self_monitoring() has not been called. Schema change: No (new function only).

UX-2 — setup_self_monitoring() warm-up hint when history is sparse

If pgt_refresh_history has fewer than 50 rows when setup_self_monitoring() is called, emit a NOTICE: "Dog-feeding stream tables created. DF analytics will populate as refresh history accumulates (currently N rows; recommend ≥ 50 before consulting df_threshold_advice)." This prevents operators from acting on meaningless LOW-confidence advice immediately after setup.

Verify: call setup_self_monitoring() on a fresh install; assert NOTICE contains the row count and the ≥ 50 recommendation. Dependencies: DF-F4. Schema change: No.

UX-3 — NOTIFY on anomaly via pg_trickle_alert channel

When df_anomaly_signals detects a duration_anomaly IS NOT NULL or recent_failures >= 2 after a refresh, emit a pg_notify('pg_trickle_alert', payload::text) with event = 'self_monitor_anomaly', the stream table name, anomaly type, last duration, baseline, and a plain-English recommendation. This integrates with existing alert pipelines without requiring a new channel. Fires from a post-refresh trigger on df_anomaly_signals or from the auto-apply post-tick hook.

Verify: E2E test LISTEN on pg_trickle_alert; inject a 3× duration spike; assert NOTIFY payload arrives with correct anomaly type. Dependencies: DF-A1. Schema change: No.

UX-4 — GETTING_STARTED.md: “Day 2 operations” section

Add a new section to docs/GETTING_STARTED.md covering the first steps after initial deployment: (1) enable self-monitoring with setup_self_monitoring(), (2) check status with self_monitoring_status(), (3) query df_threshold_advice to tune thresholds, (4) set up anomaly alerting via LISTEN. This gives new users a clear post-install checklist and demonstrates the self-monitoring value proposition immediately.

Verify: documentation PR reviewed; code examples in GETTING_STARTED.md execute without modification. Dependencies: UX-1, UX-2. Schema change: No.

UX-5 — explain_st() shows if a DF ST covers the queried stream table

When a user calls pgtrickle.explain_st('my_table'), append a line "Dog-feeding coverage: df_efficiency_rolling ✓, df_threshold_advice ✓" (or "Not set up — run setup_self_monitoring()") to the output. This surfaces the analytics plane to users who might not know self-monitoring exists, without requiring a separate function call.

Verify: SELECT explain_st('any_table') output includes a self_monitoring field in the JSON output. Dependencies: UX-1. Schema change: No.

UX-8 — df_threshold_advice extended with SLA headroom column

Extend the DF-3 defining query to include a computed sla_headroom_ms column: freshness_deadline_ms - avg_diff_ms from pgt_refresh_history. When sla_headroom_ms < 0, add a boolean sla_breach_risk = true flag so operators can see at a glance which STs risk missing their freshness SLA on the next DIFFERENTIAL cycle. The freshness_deadline column already exists in pgt_refresh_history (since v0.2.3). No schema change required.

Verify: create an ST with a tight freshness_deadline; run slow synthetic refreshes; assert df_threshold_advice.sla_breach_risk = true. Dependencies: DF-A2. Schema change: No (view column addition only).

UX-6 — recommend_refresh_mode() exposed in explain_st() JSON output

explain_st() already shows self-monitoring coverage (UX-5). Extend its JSON output with a recommended_mode field reading from df_threshold_advice (OPS-1). If OPS-1 is not available (no DF setup), fall back to null with a setup_self_monitoring() hint. Keeps the single-function diagnostic surface comprehensive without requiring separate calls.

Verify: SELECT explain_st('any_table') JSON includes recommended_mode and mode_confidence fields. Dependencies: OPS-1. Schema change: No.

UX-7 — scheduler_overhead() output included in TUI diagnostics panel

The TUI (pgtrickle-tui) already shows refresh latency sparklines and ST status. Add a diagnostics panel (toggle key D) showing the fields from scheduler_overhead(): busy ratio, queue depth, and DF fraction as a percentage. Gives operators hands-on observability without needing psql.

Verify: TUI diagnostics panel shows all three scheduler overhead fields; df_refresh_fraction updates after each DF refresh cycle. Dependencies: OPS-3. Schema change: No.


Test Coverage

ID Title Effort Priority
TEST-1 Property test: DF-3 recommended threshold always ∈ [0.01, 0.80] S P0
TEST-2 Light E2E: self-monitoring create/refresh/teardown full cycle S P0
TEST-3 Upgrade test: pgt_refresh_history rows survive 0.19.0 → 0.20.0 S P0
TEST-4 Regression test: DF STs absent from check_cdc_health() anomaly list XS P1
TEST-5 Stability test: self-monitoring under 1-h soak with 50 user STs M P1
TEST-6 Light E2E: setup_self_monitoring() idempotency (3× call) XS P1

TEST-1 — Property test: DF-3 recommended threshold always ∈ [0.01, 0.80]

Implements CORR-1 as a proptest unit test. Generate random (avg_diff_ms: 0.0–100_000.0, avg_full_ms: 0.0–100_000.0, current: 0.01–0.80) triples, compute the DF-3 CASE expression in Rust, assert output ∈ [0.01, 0.80]. Can be a pure Rust unit test in src/refresh.rs alongside the existing compute_adaptive_threshold tests — no database required.

Verify: just test-unit passes; 10,000 proptest iterations with zero failures. Dependencies: CORR-1. Schema change: No.

TEST-2 — Light E2E: self-monitoring create/refresh/teardown full cycle

A light E2E test (stock postgres:18.3 container) that: (1) installs the extension, (2) creates 3 user STs, (3) runs 5 refresh cycles to populate history, (4) calls setup_self_monitoring(), (5) refreshes all DF STs once, (6) asserts self_monitoring_status() shows 5 active STs, (7) calls teardown_self_monitoring(), (8) asserts all DF STs are gone.

Verify: test passes in just test-light-e2e with zero assertions failed. Schema change: No.

TEST-3 — Upgrade test: pgt_refresh_history rows survive 0.19.0 → 0.20.0

The 0.19.0 → 0.20.0 migration adds an index to pgt_refresh_history (PERF-1). The upgrade must not truncate, reorder, or modify existing history rows. Write an upgrade E2E test: deploy 0.19.0, run 10 refreshes, ALTER EXTENSION pg_trickle UPDATE, assert all 10 history rows are intact and the new index exists.

Verify: upgrade E2E test passes; SELECT count(*) FROM pgt_refresh_history unchanged after upgrade. Schema change: Yes (index).

TEST-4 — Regression test: DF STs absent from check_cdc_health() anomaly list

pgtrickle.check_cdc_health() scans all stream tables for CDC anomalies. After setup_self_monitoring(), DF STs must not appear in the anomaly list just because they are refreshed at longer intervals (48–96 s). Their schedules must be recognised as intentionally relaxed, not “falling behind”.

Verify: E2E test: setup_self_monitoring() → wait one full DF cycle → assert check_cdc_health() returns no anomalies for any df_ table. Dependencies: DF-F4. Schema change: No.

TEST-5 — Stability test: self-monitoring under 1-h soak with 50 user STs

Extends DF-D4. Runs 50 user STs + 5 DF STs for 1 hour under steady insert load (1 000 rows/min across all sources). Assertions: (a) all DF STs remain ACTIVE, (b) no OOM or background worker crash, © DF-1 avg refresh duration < 5 s throughout, (d) pgtrickle.self_monitoring_status() shows 5 active STs at end of run.

Verify: soak test passes with all four assertions. Dependencies: DF-D4, SCAL-1. Schema change: No.

TEST-6 — Light E2E: setup_self_monitoring() idempotency (3× call)

Implements STAB-1 as a light E2E test. Call setup_self_monitoring() three consecutive times in the same session. Assert: no errors, exactly five df_ stream tables in pgt_stream_tables, no duplicate triggers in pg_trigger for history table.

Verify: test passes in just test-light-e2e; SELECT count(*) FROM pgtrickle.pgt_stream_tables WHERE pgt_name LIKE 'df_%' = 5 after all three calls. Dependencies: STAB-1. Schema change: No.


Conflicts & Risks

  1. PERF-1 (index addition) requires a migration script change. Adding CREATE INDEX CONCURRENTLY to the 0.19.0 → 0.20.0 migration must be tested with just check-upgrade-all. CONCURRENTLY cannot run inside a transaction block — the migration must issue it outside the default single-transaction DDL wrapper.

  2. UX-3 (NOTIFY on anomaly) fires from a post-refresh path. If the pg_notify() call fails (e.g., payload too large), it must not roll back the DF-2 refresh. Wrap the notify in a BEGIN … EXCEPTION WHEN OTHERS THEN NULL END block, or fire it from a deferred trigger.

  3. STAB-3 (DROP EXTENSION cycle) requires DF STs to be extension-owned or cleanly unregistered. If DF STs are not extension-owned objects, DROP EXTENSION CASCADE will not drop them. Either register them as extension members or document that teardown_self_monitoring() must be called before DROP EXTENSION.

  4. TEST-5 (soak test) overlaps with the existing soak test in CI. Add it to the daily stability-tests.yml workflow rather than ci.yml to avoid extending PR CI time. Mark with #[ignore] and trigger via just test-soak.

  5. CORR-5 / PERF-4 interaction. The start_time > now() - interval '1 hour' boundary and the index depend on the planner choosing an index range scan. On very busy deployments where the cardinality estimate is off, the planner may prefer a seq-scan. Consider adding SET enable_seqscan = off inside the DF stream table queries if plan stability is a concern.

  6. PERF-6 (columnar tracking) is a schema change — deferred twice already. The changed_columns column addition to all change buffer tables requires a migration script. Gate strictly behind pg_trickle.columnar_tracking = off default. If capacity is tight, PERF-6 can be cut from v0.20.0 without affecting any other item — it shares no code paths with the DF pipeline.

  7. OPS-2 (check_cdc_health() enrichment) has a fallback requirement. When setup_self_monitoring() has not been called, the function must fall back to the old full-scan path without error. Guard with a catalog check for df_cdc_buffer_trends existence before querying it.

  8. OPS-4 (explain_dag()) output size. At 100+ user STs the Mermaid output may exceed typical terminal width. Offer format => 'dot' and limit => N arguments to constrain output. Default format => 'mermaid' with a NOTICE when DAG has > 20 nodes.

  9. OPS-6 (workload-aware poll) writes to the scheduler hot path. The compute_adaptive_poll_ms() function is called on every scheduler tick. The DF-5 read must be a single O(1) catalog lookup (latest row only), not a full table scan. Guard with LIMIT 1 ORDER BY collected_at DESC. If the DF-5 table does not exist (self-monitoring not set up), fall back to the old backoff logic without error.

  10. DASH-1 (Grafana) depends on postgres-exporter SQL queries. The dashboard panels use custom SQL collectors in the postgres-exporter config. Verify that monitoring/ docker-compose already mounts query config; if not, add a pg_trickle_df_queries.yaml collector file alongside the existing exporter config.

  11. DBT-1 macro idempotency. The pgtrickle_enable_monitoring macro calls setup_self_monitoring() on every dbt run. Document that this is intentionally safe (STAB-1) and adds < 5 ms overhead per run.

v0.20.0 total: ~3–4 weeks

Exit criteria: - [x] DF-F1: pgt_refresh_history receives CDC INSERT triggers when create_stream_table() is called - [x] DF-F2: df_efficiency_rolling created and refreshes correctly in DIFFERENTIAL mode - [x] DF-F3: DF-1 output matches refresh_efficiency() results on synthetic history - [x] DF-F4: setup_self_monitoring() creates all five df_* stream tables in one call - [x] DF-F5: teardown_self_monitoring() drops all df_* tables cleanly with no orphaned triggers - [x] DF-A1: df_anomaly_signals created and detects 3× duration spikes - [x] DF-A2: df_threshold_advice provides HIGH-confidence recommendations after ≥ 20 refresh cycles - [x] DF-A3: DAG ensures DF-1 refreshes before DF-2 and DF-3 in every scheduler tick - [x] DF-C1: df_cdc_buffer_trends created (FULL or DIFFERENTIAL mode) - [x] DF-C2: df_scheduling_interference detects overlapping concurrent refreshes - [x] DF-G1: pg_trickle.self_monitoring_auto_apply GUC registered with default off - [x] DF-G2: Auto-apply adjusts threshold with ≥ 1 confirmed change in E2E test - [x] DF-G5: Rate limiting verified — no more than 1 change per ST per 10 minutes - [x] DF-D3: Suspending all df_* STs does not affect control-plane operation - [x] CORR-1: df_threshold_advice output always within [0.01, 0.80] (property test) - [x] CORR-2: No false-positive DURATION_SPIKE on first-ever refresh of a new ST - [x] CORR-3: avg_change_ratio is NULL or in [0, 1] for zero-delta sources - [x] CORR-4: Only INSERT triggers (no UPDATE/DELETE) on pgt_refresh_history - [x] STAB-1: setup_self_monitoring() called 3× produces no errors and no duplicates - [x] STAB-2: Auto-apply worker logs WARNING (not panic) when ALTER target disappears - [x] STAB-3: DROP EXTENSION + CREATE EXTENSION + setup_self_monitoring() cycle works cleanly - [x] PERF-1: pgt_refresh_history(pgt_id, start_time) index exists and is used by DF queries - [x] PERF-2: DF-1 read ≥ 5× faster than refresh_efficiency() at 10 K history rows - [x] UX-1: pgtrickle.self_monitoring_status() returns correct status for all five DF STs - [x] UX-2: setup_self_monitoring() emits warm-up NOTICE when history has < 50 rows - [x] UX-3: pg_trickle_alert NOTIFY received within one DF cycle after a 3× duration spike - [x] TEST-1: Proptest for DF-3 threshold bounds passes 10,000 iterations - [x] TEST-2: Light E2E full cycle test passes - [x] TEST-3: Upgrade E2E: history rows intact and index present after 0.19.0 → 0.20.0 - [x] TEST-4: check_cdc_health() reports no anomalies for df_* tables after setup - [x] OPS-1: recommend_refresh_mode() returns mode{'DIFFERENTIAL','FULL','AUTO'} and confidence{'HIGH','MEDIUM','LOW'} - [x] OPS-2: check_cdc_health() returns spill-risk alert when buffer growth rate extrapolates to breach threshold within 2 cycles - [x] OPS-3: scheduler_overhead() returns non-NULL fields after ≥ 5 refresh cycles; df_refresh_fraction < 0.01 in soak test - [x] OPS-4: explain_dag() output contains all five df_* nodes after setup_self_monitoring() - [x] OPS-5: sql/self_monitoring_setup.sql executes without errors on a fresh install - [x] PERF-5: Concurrent history purge + DF CDC INSERT produces no lock wait timeouts in soak test - [x] PERF-6: changed_columns bitmask stored in change buffer for UPDATE rows when columnar_tracking = on (if included) - [x] OPS-6: Soak test shows lower overlap_count in DF-5 with workload-aware poll enabled vs disabled - [x] DASH-1: docker compose up in monitoring/ loads pg_trickle_self_monitoring dashboard; all 5 panels show data - [x] DBT-1: pgtrickle_enable_monitoring macro runs twice without error; self_monitoring_status() shows 5 active STs after both calls - [x] UX-8: df_threshold_advice.sla_breach_risk = true when avg_diff_ms > freshness_deadline_ms on synthetic data - [x] Extension upgrade path tested (0.19.0 → 0.20.0) - [x] just check-version-sync passes