Plain-language companion: v0.27.0.md

v0.27.0 — Operability, Observability & DR

Status: Planned. Sourced from PLAN_OVERALL_ASSESSMENT_2.md §4, §7, §9 — the remaining actionable items not addressed in v0.24.0–v0.26.0.

Release Theme This release closes the final pre-1.0 operability gaps identified in the v0.23.0 deep-analysis report. Four complementary themes: (1) a snapshot and PITR API so fresh replicas can bootstrap from a point-in-time export rather than re-running the full defining query; (2) a predictive maintenance window planner that turns the v0.22 cost model into actionable schedule recommendations; (3) a cluster-wide observability layer exposing per-database worker allocation from the postmaster and adding per-DB Prometheus metric labels; and (4) OpenMetrics conformance hardening for the metrics endpoint, including cluster-wide aggregation and a conformance test. Together these items leave pg_trickle well-positioned for the v1.0 stable release.

Stream-Table Snapshot & Point-in-Time Restore

In plain terms: snapshot_stream_table() exports the current content of a stream table — its frontier, content hash, and all rows — into an archival companion table. restore_from_snapshot() rehydrates that state on a fresh replica in seconds, skipping the full defining-query re-execution. Aligns ST state with logical wall-clock for PITR workflows.

Item Description Effort Ref
SNAP-1 snapshot_stream_table(name, target) SQL function. Exports (pgt_id, frontier, content_hash, rows) to an archival table pgtrickle.snapshot_<name>_<timestamp>. Creates the table if it does not exist; overwrites with CREATE TABLE … AS SELECT. Snapshot includes the frontier LSN and current pgt_stream_tables metadata row. SnapshotAlreadyExists error variant if target is given and already occupied. 2d PLAN_OVERALL_ASSESSMENT_2.md §7
SNAP-2 restore_from_snapshot(name, source) SQL function. Rehydrates a stream table from a snapshot table created by SNAP-1. Replays the archived frontier into pgt_stream_tables, bulk-inserts rows, skips the initial full-refresh cycle. SnapshotSourceNotFound, SnapshotSchemaVersionMismatch error variants. 2d PLAN_OVERALL_ASSESSMENT_2.md §7
SNAP-3 list_snapshots(name) + drop_snapshot(snapshot_table). Monitoring function returning all snapshots for a given ST (name, creation time, row count, frontier, size_bytes). drop_snapshot drops the archival table and removes it from the metadata catalog. 0.5d PLAN_OVERALL_ASSESSMENT_2.md §7
SNAP-4 Tests. Integration: snapshot → drop ST → restore → verify rows and frontier match; schema-version mismatch returns error; snapshot on IMMEDIATE-mode ST. E2E: fresh-replica bootstrap via snapshot completes in < 5 s for 1M-row ST. 1d PLAN_OVERALL_ASSESSMENT_2.md §7
SNAP-5 Documentation. SQL_REFERENCE.md: snapshot/restore API. PATTERNS.md: “Replica Bootstrap & PITR Alignment” section. BACKUP_AND_RESTORE.md: updated to cover the snapshot path alongside pg_dump. 0.5d PLAN_OVERALL_ASSESSMENT_2.md §7

Snapshot/PITR subtotal: ~6 days

Predictive Maintenance Window Planner

In plain terms: recommend_schedule(name) analyses the per-ST cost-model history (accumulated since v0.22) and returns a recommended refresh_interval, peak-window cron expression, and confidence score. A longer-term extension can flag expected cost spikes in advance so operators can act before SLA breaches occur.

Item Description Effort Ref
PLAN-1 pgtrickle.recommend_schedule(name) SQL function. Returns a single JSONB row with recommended_interval_seconds, peak_window_cron, confidence (0–1), and reasoning (text). Uses the per-ST last_full_ms/last_diff_ms history and the v0.25.0 median+MAD model. Confidence is 0.0 if fewer than pg_trickle.schedule_recommendation_min_samples (default 20) observations are available. 2d PLAN_OVERALL_ASSESSMENT_2.md §7
PLAN-2 pgtrickle.schedule_recommendations() set-returning function. Returns one row per registered ST with name, current_interval_seconds, recommended_interval_seconds, delta_pct, confidence, reasoning. Sortable by delta_pct DESC so operators can quickly find the most mis-tuned STs. 1d PLAN_OVERALL_ASSESSMENT_2.md §7
PLAN-3 Spike-forecast alert. Post-tick hook: if the cost model predicts the next refresh will exceed the ST’s SLA by > 20 %, emit a pg_trickle_alert event predicted_sla_breach with stream_table, predicted_ms, and sla_ms. Alert is debounced — at most one per pg_trickle.schedule_alert_cooldown_seconds (default 300 s). 1.5d PLAN_OVERALL_ASSESSMENT_2.md §7
PLAN-4 Tests + documentation. Unit: recommend_schedule returns confidence = 0.0 before min_samples; returns non-trivial recommendation after synthetic history injection; spike-forecast alert fires exactly once per cooldown window. SQL_REFERENCE.md: recommend_schedule + schedule_recommendations API. CONFIGURATION.md: two new GUCs. 1d PLAN_OVERALL_ASSESSMENT_2.md §7

Predictive planner subtotal: ~5.5 days

Cluster-Wide Observability

In plain terms: The v0.25.0 worker_allocation_status() view covers per-database quota usage but only from within a single database connection. This adds a postmaster-level cluster summary visible from any database, tags every Prometheus metric with db_oid (enabling per-DB Grafana panels across a cluster), and publishes a multi-tenant deployment guide.

Item Description Effort Ref
CLUS-1 pgtrickle.cluster_worker_summary() SQL function. Reads the shared-memory worker-pool shmem block (already populated by all DB bgworkers) and returns one row per database: db_oid, db_name, workers_active, workers_queued, quota, quota_utilization_pct. Accessible from any database in the cluster without cross-DB SPI. 2d PLAN_OVERALL_ASSESSMENT_2.md §4
CLUS-2 Per-DB Prometheus metric labels. Tag all metrics emitted by src/metrics_server.rs with db_oid=<oid> and db_name=<name> labels. Enables per-DB Grafana panels and per-DB alerting rules without separate endpoints. 1d PLAN_OVERALL_ASSESSMENT_2.md §4, §9
CLUS-3 docs/integrations/multi-tenant.md (new page). Documents recommended multi-DB deployment patterns: quota allocation formula (ceil(total_workers / N_databases)), GUC configuration, Grafana dashboard snippets using db_name labels, and cluster_worker_summary() usage. 0.5d PLAN_OVERALL_ASSESSMENT_2.md §4
CLUS-4 docs/SCALING.md update. Add a “Cluster-wide worker fairness” section cross-referencing cluster_worker_summary(), the new quota GUC documentation, and the multi-tenant integration page. 0.5d PLAN_OVERALL_ASSESSMENT_2.md §4

Cluster observability subtotal: ~4 days

OpenMetrics Conformance & Metrics Hardening

In plain terms: The src/metrics_server.rs endpoint introduced in v0.20 has zero unit tests and no validation that its output conforms to the OpenMetrics text format. This item adds a conformance test, port-conflict and timeout handling, and a cluster-wide aggregation view.

Item Description Effort Ref
METR-1 OpenMetrics conformance test. Parse the /metrics output with the openmetrics_parser crate (or equivalent) and assert no validation errors. Run as a unit test in src/metrics_server.rs using a mock request. 1d PLAN_OVERALL_ASSESSMENT_2.md §9
METR-2 Port-conflict and timeout unit tests. Test that metrics_server::start() returns a typed MetricsServerError::PortInUse when the port is occupied, and MetricsServerError::Timeout when the request handler exceeds pg_trickle.metrics_request_timeout_ms (new GUC, default 5000 ms). 1d PLAN_OVERALL_ASSESSMENT_2.md §9
METR-3 pgtrickle.metrics_summary() cluster-wide aggregation view. Set-returning function that aggregates key counters across all databases visible in pg_stat_activity (refresh count, error count, worker utilisation). Feeds the cluster-level Grafana overview dashboard. 1.5d PLAN_OVERALL_ASSESSMENT_2.md §9
METR-4 Malformed-HTTP handler. Catch malformed HTTP requests to the metrics endpoint; return 400 Bad Request with a plain-text error body rather than panicking. Add unit test. 0.5d PLAN_OVERALL_ASSESSMENT_2.md §9

Metrics hardening subtotal: ~4 days

Dependency Upgrades

In plain terms: pgrx 0.18.0 updates the proc-macro and SPI interfaces. This item upgrades the dependency, audits all pg_sys::* call sites for breaking changes, and validates the full test suite under the new version.

Item Description Effort Ref
DEP-1 pgrx 0.17.0 → 0.18.0 upgrade. Bump pgrx and pgrx-tests in Cargo.toml; run cargo pgrx init for the target PG 18 version; resolve any API breakage in proc-macro annotations, SPI call sites, memory-context helpers, and pg_sys::* usages. 1–2d
DEP-2 Full test suite validation. Run just test-all under pgrx 0.18.0; fix any regressions. Update AGENTS.md pgrx version reference. 0.5d

Dependency upgrades subtotal: ~1.5–2.5 days

Implementation Phases

Phase Description Duration
Phase 1 Snapshot/PITR: catalog, SQL functions, tests, documentation Days 1–6
Phase 2 Predictive planner: recommend_schedule, schedule_recommendations, spike-forecast alert, tests Days 6–11.5
Phase 3 Cluster observability: cluster_worker_summary, per-DB labels, multi-tenant docs, SCALING.md Days 11.5–15.5
Phase 4 Metrics hardening: OpenMetrics conformance, port-conflict tests, aggregation view, malformed-HTTP handler Days 15.5–19.5
Phase 5 Dependency upgrades: pgrx 0.18.0, full test-suite validation Days 19.5–21.5
Phase 6 Integration testing, upgrade script, documentation review Days 21.5–24

v0.27.0 total: ~3–4 weeks (~24 person-days solo)

Exit criteria: - [x] SNAP-1: snapshot_stream_table() creates archival table with correct frontier and row data - [x] SNAP-2: restore_from_snapshot() rehydrates ST; first refresh cycle after restore is DIFFERENTIAL (not FULL) - [x] SNAP-3: list_snapshots() lists all snapshots for a ST; drop_snapshot() removes archival table and catalog row - [x] SNAP-4: Fresh-replica bootstrap via snapshot completes in < 5 s for 1M-row ST - [x] SNAP-5: BACKUP_AND_RESTORE.md updated; PATTERNS.md “Replica Bootstrap & PITR Alignment” section added - [x] PLAN-1: recommend_schedule() returns confidence = 0.0 before min_samples; returns non-trivial recommendation with synthetic history - [x] PLAN-2: schedule_recommendations() returns one row per ST; sortable by delta_pct - [x] PLAN-3: predicted_sla_breach alert fires once per cooldown window; no duplicate alerts - [x] PLAN-4: All unit tests for planner pass; two new GUCs documented - [x] CLUS-1: cluster_worker_summary() returns accurate per-DB worker counts from any database in the cluster - [x] CLUS-2: All Prometheus metrics carry db_oid and db_name labels; existing Grafana dashboard templates updated - [x] CLUS-3: docs/integrations/multi-tenant.md published with quota formula and Grafana snippets - [x] CLUS-4: docs/SCALING.md cluster-wide fairness section added - [x] METR-1: OpenMetrics conformance test passes; zero parse errors on live /metrics output - [x] METR-2: Port-conflict test returns MetricsServerError::PortInUse; timeout test returns MetricsServerError::Timeout - [x] METR-3: metrics_summary() returns aggregated counters; Grafana cluster-overview query documented - [x] METR-4: Malformed HTTP request returns 400 Bad Request; no panic - [x] DEP-1: pgrx bumped to 0.18.0; all API breakage resolved; extension builds clean - [x] DEP-2: just test-all passes under pgrx 0.18.0; AGENTS.md pgrx version reference updated - [x] Extension upgrade path tested (0.26.0 → 0.27.0) - [x] just check-version-sync passes