Error Budgets for Stream Tables

Error Budgets for Stream Tables

SRE-style freshness monitoring with p50/p99 latency, staleness tracking, and budget consumption

Your team has an SLA: “the dashboard must reflect data no older than 30 seconds.” You’ve set the stream table schedule to 5 seconds. That should leave plenty of headroom. But is it?

What if the stream table occasionally takes 20 seconds to refresh because of a complex join? What if it fails 3 times in a row and the scheduler suspends it? What if the change buffer grows large enough to cause a FULL fallback?

You won’t know unless you measure. pg_trickle’s sla_summary() function provides SRE-style metrics: percentile latencies, freshness lag, error counts, and an error budget that tells you how much headroom you have before your SLA is violated.

The SLA Summary API

SELECT * FROM pgtrickle.sla_summary('dashboard_metrics');

 stream_table      | p50_refresh_ms | p99_refresh_ms | freshness_lag_s | error_count | error_budget_pct | window
-------------------+----------------+----------------+-----------------+-------------+------------------+---------
 dashboard_metrics | 4.8            | 23.1           | 2.3             | 1           | 94.2             | 1h

What Each Metric Means

p50_refresh_ms: The median refresh duration over the measurement window. If this is close to your schedule interval, you’re refreshing slower than you’re scheduling.

p99_refresh_ms: The 99th percentile refresh duration. This is your worst-case performance (excluding true outliers). If your schedule is 5 seconds and p99 is 23 seconds, 1% of your refreshes are taking nearly 5× longer than expected.

freshness_lag_s: The current staleness — time since the last successful refresh completed. If this exceeds your SLA, the data is stale right now.

error_count: Number of failed refreshes in the measurement window. Each failure means one missed refresh cycle.

error_budget_pct: The remaining error budget as a percentage. This is the key metric:

$$\text{error_budget_pct} = 100 \times \left(1 - \frac{\text{actual_staleness_violations}}{\text{allowed_staleness_violations}}\right)$$

At 100%, you’ve had zero violations. At 0%, you’ve exhausted your budget. Below 0%, you’re in breach.

Defining SLAs

pg_trickle doesn’t enforce SLAs — it measures compliance. The SLA is defined by your team’s freshness requirement.

The sla_summary() function accepts an optional window parameter:

-- Last hour (default)
SELECT * FROM pgtrickle.sla_summary('dashboard_metrics');

-- Last 24 hours
SELECT * FROM pgtrickle.sla_summary('dashboard_metrics', window => '24h');

-- Last 7 days
SELECT * FROM pgtrickle.sla_summary('dashboard_metrics', window => '7d');

For a 30-second freshness SLA measured over 1 hour: - There are 120 expected refresh opportunities (at 30-second intervals). - If 6 of those opportunities were missed (refresh took too long or failed), that’s a 5% violation rate → 95% error budget remaining.

Alerting on Error Budget

Combine sla_summary() with reactive subscriptions or pg_cron to alert when the budget is low:

-- pg_cron: check error budget every 5 minutes
SELECT cron.schedule('sla-check', '*/5 * * * *', $$
  DO $$
  DECLARE
    budget NUMERIC;
  BEGIN
    SELECT error_budget_pct INTO budget
    FROM pgtrickle.sla_summary('dashboard_metrics');

    IF budget < 50 THEN
      PERFORM pg_notify('sla_warning',
        format('dashboard_metrics error budget at %s%%', budget));
    END IF;
    IF budget < 10 THEN
      PERFORM pg_notify('sla_critical',
        format('dashboard_metrics error budget at %s%%', budget));
    END IF;
  END $$;
$$);

A listener picks up the NOTIFY and routes it to PagerDuty, Slack, or your alerting system.

Setting Meaningful SLAs

Different stream tables serve different purposes. A real-time fraud detection table and a weekly summary report shouldn’t have the same SLA.

Recommended tiers:

Tier	Freshness Target	Example
Critical	< 5 seconds	Fraud detection, inventory levels, live pricing
Standard	< 30 seconds	Dashboards, operational metrics, search indexes
Background	< 5 minutes	Daily reports, batch analytics, archival

For each tier, set the stream table schedule to ~⅓ of the freshness target. This gives 3 refresh attempts before the SLA is breached:

Tier	Freshness Target	Schedule
Critical	5s	1–2s
Standard	30s	10s
Background	5m	1–2m

Diagnosing Budget Consumption

When the error budget is dropping, diagnose the cause:

1. Check refresh history

SELECT
  refresh_id,
  refresh_mode,
  duration_ms,
  rows_changed,
  status,
  error_message
FROM pgtrickle.get_refresh_history('dashboard_metrics')
WHERE status != 'success' OR duration_ms > 10000
ORDER BY refresh_id DESC
LIMIT 20;

Look for: - status = 'error' — failed refreshes. Check error_message. - duration_ms spikes — slow refreshes that may exceed the SLA window. - refresh_mode = 'FULL' — fallbacks from DIFFERENTIAL to FULL, which are usually slower.

2. Check for mode fallbacks

SELECT
  COUNT(*) FILTER (WHERE refresh_mode = 'DIFFERENTIAL') AS differential_count,
  COUNT(*) FILTER (WHERE refresh_mode = 'FULL') AS full_count,
  AVG(duration_ms) FILTER (WHERE refresh_mode = 'DIFFERENTIAL') AS avg_diff_ms,
  AVG(duration_ms) FILTER (WHERE refresh_mode = 'FULL') AS avg_full_ms
FROM pgtrickle.get_refresh_history('dashboard_metrics')
WHERE created_at > NOW() - INTERVAL '1 hour';

If FULL refreshes are frequent and slow, investigate why DIFFERENTIAL is falling back (change ratio too high, spilling, etc.).

3. Check change buffer sizes

SELECT * FROM pgtrickle.pgt_cdc_status
WHERE pgt_name = 'dashboard_metrics';

Large change buffers → larger deltas → slower DIFFERENTIAL refreshes → potential SLA violations.

Error Budget Across All Stream Tables

For a system-wide view:

SELECT
  s.name,
  s.schedule,
  sla.p50_refresh_ms,
  sla.p99_refresh_ms,
  sla.freshness_lag_s,
  sla.error_budget_pct,
  CASE
    WHEN sla.error_budget_pct >= 95 THEN 'healthy'
    WHEN sla.error_budget_pct >= 50 THEN 'warning'
    ELSE 'critical'
  END AS status
FROM pgtrickle.pgt_stream_tables s
CROSS JOIN LATERAL pgtrickle.sla_summary(s.name) sla
ORDER BY sla.error_budget_pct ASC;

This ranks all stream tables by error budget health. Tables at the top of the list need attention.

Prometheus Integration

pg_trickle exports SLA metrics as Prometheus gauges:

pg_trickle_refresh_p50_ms{stream_table="dashboard_metrics"} 4.8
pg_trickle_refresh_p99_ms{stream_table="dashboard_metrics"} 23.1
pg_trickle_freshness_lag_seconds{stream_table="dashboard_metrics"} 2.3
pg_trickle_error_budget_pct{stream_table="dashboard_metrics"} 94.2

Use these in Grafana dashboards and alerting rules:

# Prometheus alert rule
- alert: StreamTableSLABreach
  expr: pg_trickle_error_budget_pct < 20
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Stream table {{ $labels.stream_table }} error budget below 20%"

Summary

sla_summary() gives you SRE-style visibility into stream table health: percentile latencies, freshness lag, error counts, and error budget.

Set SLAs per tier (critical/standard/background). Monitor the error budget. Alert when it drops below a threshold. Diagnose with refresh history, mode fallback analysis, and change buffer checks.

The goal: know you’re meeting your freshness SLA before a customer asks why the dashboard is stale. The error budget tells you exactly how much headroom you have left.

PGXN

PostgreSQL Extension Network

Contents