RStats Type Documentation

Overview

The RStats type implements numerically stable running statistics using Welford’s algorithm. This document describes the design decisions, implementation details, and rationale for the sentinel value approach used to represent empty state.

Sentinel Value Design

Design Choice: Zero Sentinels

The RStats type uses count=0 with all other fields set to 0.0 as the canonical empty state representation.

Rationale

Why Count=0 is Primary

  • Unambiguous: count=0 is the natural definition of “no data accumulated”
  • Simple: Single boolean check (count == 0) for emptiness
  • No conflicts: Cannot collide with legitimate statistics (any real data has count >= 1)

Why All Fields Must Be Zero When Empty

While count=0 alone is sufficient to identify empty state, we enforce that all other fields (mean, m2, min, max) must also be 0.0 when count=0. This provides:

  1. Canonical Representation

    • Empty state has exactly one valid serialization: (count:0,mean:0,min:0,max:0,stddev:0)
    • Enables byte-for-byte comparison of empty values
    • Simplifies debugging and testing
  2. Clean Display

    • Text output shows intuitive zeros: (count:0,mean:0,min:0,max:0,stddev:0)
    • No confusing sentinel values like -1 that might suggest data
  3. Corruption Detection

    • If count=0 but other fields are non-zero, we know data is corrupted
    • Catches bugs in serialization/deserialization
    • Detects memory corruption or improper initialization
  4. Consistency Across Formats

    • Binary and text formats produce identical empty representations
    • No format-specific edge cases

Implementation: Defense in Depth

The canonical empty state (count=0, all zeros) is enforced at four validation points:

1. Text Input Validation (rstats_in)

Purpose: Reject malformed text input from users or external systems When: Parsing SQL input like SELECT '(count:0,mean:1,...)'::rstats Error Code: ERRCODE_INVALID_TEXT_REPRESENTATION

2. Binary Input Validation (rstats_recv)

Purpose: Catch corruption in network protocol or disk storage When: Receiving binary data from client or reading from disk Error Code: ERRCODE_DATA_CORRUPTED

3. Binary Output Validation (rstats_send)

Purpose: Catch bugs in C code that corrupts RStats in memory When: Before sending binary data to client Error Code: ERRCODE_DATA_CORRUPTED Note: Includes hint that this is an internal bug

4. Runtime Emptiness Check (rstats_is_empty)

Purpose: Guard any code path checking if statistics are empty When: Called by rstats_add_value() to detect lazy initialization Error Code: ERRCODE_DATA_CORRUPTED

Trade-offs and Limitations

Cannot Distinguish Empty vs. Init(0)

The current design cannot distinguish between: - Truly empty statistics: rstats_set_empty() - Statistics initialized with single zero value: rstats_init_internal(&stats, 0.0)

Both result in: count=1 for init(0), but count=0 for empty.

This is acceptable because: 1. The count field correctly reflects the difference (0 vs. 1) 2. Both are valid states with clear semantics 3. The distinction is not operationally important 4. User can check count if they need to distinguish

Memory Layout

typedef struct RStats {
    int64   count;  /* 8 bytes - Primary empty indicator */
    double  mean;   /* 8 bytes - Must be 0.0 when count=0 */
    double  m2;     /* 8 bytes - Must be 0.0 when count=0 */
    double  min;    /* 8 bytes - Must be 0.0 when count=0 */
    double  max;    /* 8 bytes - Must be 0.0 when count=0 */
} RStats;
/* Total: 40 bytes, fixed-size, no varlena header */

Performance Considerations

The sentinel validation has minimal performance impact:

  • Validation Cost: 4 floating-point equality comparisons
  • Compiler Optimization: Often optimized to SIMD comparison
  • Zero Overhead in Happy Path: If count > 0, skip validation entirely
  • Only on Boundaries: Validation only at I/O and emptiness checks, not during computation

Compatibility and Migration

Future Changes

If sentinel approach needs modification in the future:

  1. Update this documentation with rationale
  2. Increment type version (create rstats--0.1--0.2.sql migration script)
  3. Add compatibility layer if needed
  4. Update all four validation points consistently

RStats Operators and Functions

This section documents the available operators and their semantics.

Type Casts

Numeric to RStats

Implicit casts from numeric types initialize RStats with a single value:

-- From double precision
SELECT 42.5::rstats;
-- Result: (count:1,mean:42.5,min:42.5,max:42.5,stddev:0)

-- From integer
SELECT 100::rstats;
-- Result: (count:1,mean:100,min:100,max:100,stddev:0)

-- From numeric
SELECT 3.14159::numeric::rstats;
-- Result: (count:1,mean:3.14159,min:3.14159,max:3.14159,stddev:0)

Semantics: Creates RStats initialized with a single data point.

Binary Serialization Casts

Assignment casts for binary serialization (backup/restore):

-- RStats to bytea (serialization)
SELECT rstats()::bytea;
-- Result: \x0000000000000000...

-- Bytea to RStats (deserialization)
SELECT '\x0000000000000000...'::bytea::rstats;
-- Result: (count:0,mean:0,min:0,max:0,stddev:0)

-- Round-trip preserves state
SELECT (10::rstats + 20 + 30)::bytea::rstats;
-- Result: (count:3,mean:20,min:10,max:30,stddev:10)

Semantics: Enables binary backup/restore and external storage.

Addition Operator (+)

Accumulates a new value into running statistics:

-- Add double precision
SELECT 10.0::rstats + 20.0 + 30.0;
-- Result: (count:3,mean:20,min:10,max:30,stddev:10)

-- Add integer
SELECT 5::rstats + 10 + 15;
-- Result: (count:3,mean:10,min:5,max:15,stddev:5)

-- Chain multiple additions
SELECT ((10.0::rstats + 20.0) + 30.0) + 40.0;
-- Result: (count:4,mean:25,min:10,max:40,stddev:12.91)

Semantics: - Updates mean, stddev (via m2), min, max using Welford’s algorithm - Order-dependent: accumulates values sequentially - Deterministic: same sequence → same result - Handles NULL gracefully: stats + NULL returns NULL

Equality Operator (=)

Tests for exact identity (bit-identical state):

-- Exact match: same sequence
SELECT (10::rstats + 20 + 30) = (10::rstats + 20 + 30);
-- Result: true

-- Different sequence, same statistics
SELECT (10::rstats + 20) = (15::rstats + 15);
-- Result: false (mean is same, but internal state differs)

-- Empty state comparison
SELECT rstats() = rstats();
-- Result: true

Semantics: - Identity, not numerical equivalence: Tests if two RStats accumulated the exact same sequence of values - Deterministic algorithm: Welford’s algorithm guarantees same inputs → bit-identical state - Exact float comparison: No epsilon tolerance (by design) - Use cases: Caching, deduplication, identity checks in hash tables

For numerical similarity, compare extracted statistics: sql -- Check if means are approximately equal SELECT ABS((stats1 -> 'mean') - (stats2 -> 'mean')) < 0.001;

Distance Operator (<->)

Calculates the Mahalanobis distance between two statistical distributions:

-- Compare two distributions
SELECT (10::rstats + 20 + 30) <-> (15::rstats + 25 + 35);
-- Result: small value (similar distributions)

SELECT (10::rstats + 20 + 30) <-> (100::rstats + 200 + 300);
-- Result: larger value (dissimilar distributions)

-- Find queries with statistics most different from a reference
SELECT queryid, stats <-> reference_stats AS distance
FROM pg_track_optimizer, (SELECT 0::rstats + 1 + 2 AS reference_stats) ref
ORDER BY distance DESC;

Semantics: - Returns double precision representing statistical distance - Lower values indicate more similar distributions - Commutative: a <-> b equals b <-> a - Useful for clustering queries by statistical similarity or detecting outliers

Field Accessor Operator (->)

Extracts statistical properties as double precision:

SELECT
    stats -> 'count' AS count,
    stats -> 'mean' AS mean,
    stats -> 'stddev' AS stddev,
    stats -> 'min' AS min,
    stats -> 'max' AS max
FROM (SELECT (10::rstats + 20 + 30) AS stats) t;

Available Fields: - count: Number of accumulated values (int64 → float8) - mean: Arithmetic mean - stddev: Sample standard deviation (n-1 denominator) - min: Minimum value observed - max: Maximum value observed

Semantics: - Returns double precision (cast as needed) - Stddev uses sample formula: sqrt(m2 / (count - 1)) - Returns 0 stddev for count ≤ 1 - Works with expression indexes: CREATE INDEX ON table ((stats -> 'mean'))

Constructor Functions

Empty Constructor

SELECT rstats();
-- Result: (count:0,mean:0,min:0,max:0,stddev:0)

Semantics: Creates canonical empty state.

Polymorphic Constructor

-- Accepts any numeric-convertible type
SELECT rstats(42::int2);
SELECT rstats(3.14::float4);
SELECT rstats(100::bigint);

Semantics: Initializes with a single value (delegates to type-specific casts).

Aggregate Function

rstats_agg(double precision)

Aggregates multiple values into a single RStats object:

-- Aggregate values from a table column
SELECT rstats_agg(value) FROM measurements;
-- Result: (count:N,mean:...,min:...,max:...,stddev:...)

-- Aggregate with grouping
SELECT category, rstats_agg(price) AS price_stats
FROM products
GROUP BY category;

-- Combine with field accessor
SELECT
    category,
    rstats_agg(price) -> 'mean' AS avg_price,
    rstats_agg(price) -> 'stddev' AS price_stddev
FROM products
GROUP BY category;

Semantics: - Collects all non-NULL values in the group into running statistics - Returns canonical empty state if no values are aggregated - Uses Welford’s algorithm internally for numerical stability - Order of aggregation may affect floating-point rounding (typically negligible)

References

  • Welford’s Algorithm: https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford’s_online_algorithm
  • PostgreSQL Type System: https://www.postgresql.org/docs/current/xtypes.html
  • Error Codes: https://www.postgresql.org/docs/current/errcodes-appendix.html

Last Updated: 2026-01-10 Author: Andrei Lepikhov Reviewers: Claude Code Review