RStats Type Documentation
Overview
The RStats type implements numerically stable running statistics using Welford’s algorithm. This document describes the design decisions, implementation details, and rationale for the sentinel value approach used to represent empty state.
Sentinel Value Design
Design Choice: Zero Sentinels
The RStats type uses count=0 with all other fields set to 0.0 as the canonical empty state representation.
Rationale
Why Count=0 is Primary
- Unambiguous: count=0 is the natural definition of “no data accumulated”
- Simple: Single boolean check
(count == 0)for emptiness - No conflicts: Cannot collide with legitimate statistics (any real data has count >= 1)
Why All Fields Must Be Zero When Empty
While count=0 alone is sufficient to identify empty state, we enforce that all other fields (mean, m2, min, max) must also be 0.0 when count=0. This provides:
Canonical Representation
- Empty state has exactly one valid serialization:
(count:0,mean:0,min:0,max:0,stddev:0) - Enables byte-for-byte comparison of empty values
- Simplifies debugging and testing
- Empty state has exactly one valid serialization:
Clean Display
- Text output shows intuitive zeros:
(count:0,mean:0,min:0,max:0,stddev:0) - No confusing sentinel values like -1 that might suggest data
- Text output shows intuitive zeros:
Corruption Detection
- If count=0 but other fields are non-zero, we know data is corrupted
- Catches bugs in serialization/deserialization
- Detects memory corruption or improper initialization
Consistency Across Formats
- Binary and text formats produce identical empty representations
- No format-specific edge cases
Implementation: Defense in Depth
The canonical empty state (count=0, all zeros) is enforced at four validation points:
1. Text Input Validation (rstats_in)
Purpose: Reject malformed text input from users or external systems
When: Parsing SQL input like SELECT '(count:0,mean:1,...)'::rstats
Error Code: ERRCODE_INVALID_TEXT_REPRESENTATION
2. Binary Input Validation (rstats_recv)
Purpose: Catch corruption in network protocol or disk storage
When: Receiving binary data from client or reading from disk
Error Code: ERRCODE_DATA_CORRUPTED
3. Binary Output Validation (rstats_send)
Purpose: Catch bugs in C code that corrupts RStats in memory
When: Before sending binary data to client
Error Code: ERRCODE_DATA_CORRUPTED
Note: Includes hint that this is an internal bug
4. Runtime Emptiness Check (rstats_is_empty)
Purpose: Guard any code path checking if statistics are empty
When: Called by rstats_add_value() to detect lazy initialization
Error Code: ERRCODE_DATA_CORRUPTED
Trade-offs and Limitations
Cannot Distinguish Empty vs. Init(0)
The current design cannot distinguish between:
- Truly empty statistics: rstats_set_empty()
- Statistics initialized with single zero value: rstats_init_internal(&stats, 0.0)
Both result in: count=1 for init(0), but count=0 for empty.
This is acceptable because: 1. The count field correctly reflects the difference (0 vs. 1) 2. Both are valid states with clear semantics 3. The distinction is not operationally important 4. User can check count if they need to distinguish
Memory Layout
typedef struct RStats {
int64 count; /* 8 bytes - Primary empty indicator */
double mean; /* 8 bytes - Must be 0.0 when count=0 */
double m2; /* 8 bytes - Must be 0.0 when count=0 */
double min; /* 8 bytes - Must be 0.0 when count=0 */
double max; /* 8 bytes - Must be 0.0 when count=0 */
} RStats;
/* Total: 40 bytes, fixed-size, no varlena header */
Performance Considerations
The sentinel validation has minimal performance impact:
- Validation Cost: 4 floating-point equality comparisons
- Compiler Optimization: Often optimized to SIMD comparison
- Zero Overhead in Happy Path: If count > 0, skip validation entirely
- Only on Boundaries: Validation only at I/O and emptiness checks, not during computation
Compatibility and Migration
Future Changes
If sentinel approach needs modification in the future:
- Update this documentation with rationale
- Increment type version (create
rstats--0.1--0.2.sqlmigration script) - Add compatibility layer if needed
- Update all four validation points consistently
RStats Operators and Functions
This section documents the available operators and their semantics.
Type Casts
Numeric to RStats
Implicit casts from numeric types initialize RStats with a single value:
-- From double precision
SELECT 42.5::rstats;
-- Result: (count:1,mean:42.5,min:42.5,max:42.5,stddev:0)
-- From integer
SELECT 100::rstats;
-- Result: (count:1,mean:100,min:100,max:100,stddev:0)
-- From numeric
SELECT 3.14159::numeric::rstats;
-- Result: (count:1,mean:3.14159,min:3.14159,max:3.14159,stddev:0)
Semantics: Creates RStats initialized with a single data point.
Binary Serialization Casts
Assignment casts for binary serialization (backup/restore):
-- RStats to bytea (serialization)
SELECT rstats()::bytea;
-- Result: \x0000000000000000...
-- Bytea to RStats (deserialization)
SELECT '\x0000000000000000...'::bytea::rstats;
-- Result: (count:0,mean:0,min:0,max:0,stddev:0)
-- Round-trip preserves state
SELECT (10::rstats + 20 + 30)::bytea::rstats;
-- Result: (count:3,mean:20,min:10,max:30,stddev:10)
Semantics: Enables binary backup/restore and external storage.
Addition Operator (+)
Accumulates a new value into running statistics:
-- Add double precision
SELECT 10.0::rstats + 20.0 + 30.0;
-- Result: (count:3,mean:20,min:10,max:30,stddev:10)
-- Add integer
SELECT 5::rstats + 10 + 15;
-- Result: (count:3,mean:10,min:5,max:15,stddev:5)
-- Chain multiple additions
SELECT ((10.0::rstats + 20.0) + 30.0) + 40.0;
-- Result: (count:4,mean:25,min:10,max:40,stddev:12.91)
Semantics:
- Updates mean, stddev (via m2), min, max using Welford’s algorithm
- Order-dependent: accumulates values sequentially
- Deterministic: same sequence → same result
- Handles NULL gracefully: stats + NULL returns NULL
Equality Operator (=)
Tests for exact identity (bit-identical state):
-- Exact match: same sequence
SELECT (10::rstats + 20 + 30) = (10::rstats + 20 + 30);
-- Result: true
-- Different sequence, same statistics
SELECT (10::rstats + 20) = (15::rstats + 15);
-- Result: false (mean is same, but internal state differs)
-- Empty state comparison
SELECT rstats() = rstats();
-- Result: true
Semantics: - Identity, not numerical equivalence: Tests if two RStats accumulated the exact same sequence of values - Deterministic algorithm: Welford’s algorithm guarantees same inputs → bit-identical state - Exact float comparison: No epsilon tolerance (by design) - Use cases: Caching, deduplication, identity checks in hash tables
For numerical similarity, compare extracted statistics:
sql
-- Check if means are approximately equal
SELECT ABS((stats1 -> 'mean') - (stats2 -> 'mean')) < 0.001;
Distance Operator (<->)
Calculates the Mahalanobis distance between two statistical distributions:
-- Compare two distributions
SELECT (10::rstats + 20 + 30) <-> (15::rstats + 25 + 35);
-- Result: small value (similar distributions)
SELECT (10::rstats + 20 + 30) <-> (100::rstats + 200 + 300);
-- Result: larger value (dissimilar distributions)
-- Find queries with statistics most different from a reference
SELECT queryid, stats <-> reference_stats AS distance
FROM pg_track_optimizer, (SELECT 0::rstats + 1 + 2 AS reference_stats) ref
ORDER BY distance DESC;
Semantics:
- Returns double precision representing statistical distance
- Lower values indicate more similar distributions
- Commutative: a <-> b equals b <-> a
- Useful for clustering queries by statistical similarity or detecting outliers
Field Accessor Operator (->)
Extracts statistical properties as double precision:
SELECT
stats -> 'count' AS count,
stats -> 'mean' AS mean,
stats -> 'stddev' AS stddev,
stats -> 'min' AS min,
stats -> 'max' AS max
FROM (SELECT (10::rstats + 20 + 30) AS stats) t;
Available Fields:
- count: Number of accumulated values (int64 → float8)
- mean: Arithmetic mean
- stddev: Sample standard deviation (n-1 denominator)
- min: Minimum value observed
- max: Maximum value observed
Semantics:
- Returns double precision (cast as needed)
- Stddev uses sample formula: sqrt(m2 / (count - 1))
- Returns 0 stddev for count ≤ 1
- Works with expression indexes: CREATE INDEX ON table ((stats -> 'mean'))
Constructor Functions
Empty Constructor
SELECT rstats();
-- Result: (count:0,mean:0,min:0,max:0,stddev:0)
Semantics: Creates canonical empty state.
Polymorphic Constructor
-- Accepts any numeric-convertible type
SELECT rstats(42::int2);
SELECT rstats(3.14::float4);
SELECT rstats(100::bigint);
Semantics: Initializes with a single value (delegates to type-specific casts).
Aggregate Function
rstats_agg(double precision)
Aggregates multiple values into a single RStats object:
-- Aggregate values from a table column
SELECT rstats_agg(value) FROM measurements;
-- Result: (count:N,mean:...,min:...,max:...,stddev:...)
-- Aggregate with grouping
SELECT category, rstats_agg(price) AS price_stats
FROM products
GROUP BY category;
-- Combine with field accessor
SELECT
category,
rstats_agg(price) -> 'mean' AS avg_price,
rstats_agg(price) -> 'stddev' AS price_stddev
FROM products
GROUP BY category;
Semantics: - Collects all non-NULL values in the group into running statistics - Returns canonical empty state if no values are aggregated - Uses Welford’s algorithm internally for numerical stability - Order of aggregation may affect floating-point rounding (typically negligible)
References
- Welford’s Algorithm: https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford’s_online_algorithm
- PostgreSQL Type System: https://www.postgresql.org/docs/current/xtypes.html
- Error Codes: https://www.postgresql.org/docs/current/errcodes-appendix.html
Last Updated: 2026-01-10 Author: Andrei Lepikhov Reviewers: Claude Code Review