PLAN: External Test Suites for pg_trickle

Status: Proposed
Date: 2026-02-25
Branch: main
Scope: Adopt public SQL test suites and benchmarks to validate correctness and performance of stream tables beyond the project’s existing test infrastructure.


Table of Contents

  1. Motivation
  2. Suite 1: TPC-H-Derived — Analytical Workload
  3. Suite 2: sqllogictest — Broad Correctness
  4. Suite 3: Join Order Benchmark (JOB) — Real-World Data
  5. Suite 4: Feldera / DBSP Benchmarks — Theory Peer Comparison
  6. Implementation Roadmap
  7. Infrastructure Requirements

Motivation

pg_trickle’s existing test suite (878 unit tests, 22 E2E test suites) validates internal correctness. What it does not cover:

  • Industry-standard workloads — are stream tables effective on the queries real analysts write?
  • Scale — does differential refresh maintain its advantage at 1 GB, 10 GB, 100 GB?
  • Breadth — are there SQL patterns we haven’t thought to test that produce incorrect deltas?
  • Peer comparison — how does pg_trickle’s approach compare to standalone IVM engines like Feldera on the same workload?

Public test suites address all four gaps.


Suite 1: TPC-H-Derived — Analytical Workload

TPC-H Fair Use: Our test suite is derived from the TPC-H Benchmark specification but does not constitute a TPC-H Benchmark result. We use a custom SQL data generator (not dbgen), modified queries, and a non-standard RF3. “TPC-H” is a trademark of the Transaction Processing Performance Council (tpc.org).

Why a TPC-H-Derived Workload

The TPC-H specification defines an industry-standard decision-support schema and query set. Its 22 queries cover joins (up to 8 tables), aggregates, subqueries (scalar, EXISTS, IN), HAVING, CASE WHEN, DISTINCT — all operators pg_trickle supports in DIFFERENTIAL mode. Its two refresh functions (RF1: bulk INSERT, RF2: bulk DELETE) directly exercise the CDC → delta refresh pipeline.

Specification: https://www.tpc.org/tpch/
Our data generator: Custom pure-SQL (generate_series), not dbgen

Query Compatibility Analysis

All 22 TPC-H queries are compatible with pg_trickle:

Category Queries Modification
Works as-is Q6, Q14, Q17, Q19 None
Remove ORDER BY only Q1, Q4, Q5, Q7, Q8, Q9, Q11, Q12, Q13, Q15, Q16, Q20, Q22 Cosmetic — ORDER BY is silently ignored anyway
TopK (ORDER BY + LIMIT) Q2, Q3, Q10, Q18, Q21 ✅ Supported — ORDER BY + LIMIT now accepted as TopK pattern
  • 0 queries are blocked by unsupported SQL features.
  • Q15 additionally requires inlining its CREATE VIEW as a CTE or subquery (pg_trickle requires a single SELECT statement).
  • No TPC-H query uses GROUPING SETS, NATURAL JOIN, or recursive CTEs.

Detailed Query Matrix

# Name Key SQL Features Blocked?
Q1 Pricing Summary GROUP BY, SUM/AVG/COUNT, WHERE (date) No
Q2 Minimum Cost Supplier Correlated scalar subquery (MIN), 8-table join TopK
Q3 Shipping Priority 3-table join, GROUP BY, SUM TopK
Q4 Order Priority Checking EXISTS subquery, GROUP BY, COUNT No
Q5 Local Supplier Volume 6-table join, GROUP BY, SUM No
Q6 Forecasting Revenue Single-table SUM, WHERE filters No
Q7 Volume Shipping 6-table join, CASE WHEN, SUM No
Q8 National Market Share 8-table join, CASE WHEN, subquery in FROM No
Q9 Product Type Profit 6-table join, expressions, LIKE No
Q10 Returned Item Reporting 4-table join, GROUP BY, SUM TopK
Q11 Important Stock ID HAVING with scalar subquery, 3-table join No
Q12 Shipping Modes SUM(CASE WHEN), IN, BETWEEN No
Q13 Customer Distribution LEFT OUTER JOIN, nested GROUP BY, subquery in FROM No
Q14 Promotion Effect Conditional SUM ratio No
Q15 Top Supplier View → inline as CTE; MAX subquery Inline view
Q16 Parts/Supplier COUNT(DISTINCT), NOT IN subquery, NOT LIKE No
Q17 Small-Quantity Revenue Scalar subquery (AVG), 2-table join No
Q18 Large Volume Customer IN subquery with HAVING, 3-table join TopK
Q19 Discounted Revenue Complex OR/AND WHERE, SUM No
Q20 Potential Promotion Nested IN subqueries (2 levels) No
Q21 Suppliers Waiting EXISTS + NOT EXISTS, multi-join TopK
Q22 Global Sales Opportunity NOT EXISTS, scalar subquery, SUBSTRING No

SQL Feature Coverage by TPC-H

SQL Feature TPC-H Queries Using It pg_trickle Support
INNER JOIN Q2,Q3,Q5,Q7–Q12,Q15,Q17–Q21 ✅ Full
LEFT OUTER JOIN Q13 ✅ Full
Multi-table join (3+) Q2,Q3,Q5,Q7–Q11,Q18,Q20,Q21 ✅ Full
GROUP BY Q1,Q3–Q5,Q7–Q13,Q16,Q18,Q22 ✅ Full
SUM Q1,Q3,Q5–Q10,Q12,Q14,Q17–Q19 ✅ Algebraic
AVG Q1,Q17,Q22 ✅ Algebraic
COUNT / COUNT(DISTINCT) Q1,Q4,Q13,Q16,Q21 ✅ Algebraic / ref-counted
MIN / MAX Q2, Q15 ✅ Semi-algebraic
HAVING Q11,Q16,Q18 ✅ Full
CASE WHEN Q7,Q8,Q12,Q14,Q22 ✅ Full
EXISTS / NOT EXISTS Q4,Q21,Q22 ✅ Semi-join / anti-join
IN / NOT IN subquery Q12,Q16,Q18,Q20 ✅ Semi-join / anti-join
Scalar subquery Q2,Q11,Q15,Q17,Q22 ✅ Full
Subquery in FROM Q8,Q13,Q15,Q22 ✅ Full
BETWEEN Q1,Q3–Q6,Q12,Q15,Q20 ✅ Full
LIKE / NOT LIKE Q9,Q13,Q16 ✅ Full
LIMIT (TopK) Q2,Q3,Q10,Q18,Q21 ✅ Supported (ORDER BY + LIMIT)

Refresh Functions

Function Description CDC Path
RF1 Bulk INSERT into orders + lineitem Triggers capture → differential refresh
RF2 DELETE from orders + lineitem by key Triggers capture → differential refresh

Both RF1 and RF2 exercise pg_trickle’s core pipeline. No special handling needed — INSERTs and DELETEs are the native change types that CDC captures.

Proposed Test Plan

Phase T1-A: Correctness (Scale Factor 1, ~1 GB)

  1. Load TPC-H SF-1 data using dbgen.
  2. Create 22 stream tables (one per query, with LIMIT/ORDER BY removed).
  3. Perform initial FULL refresh for each.
  4. Run RF1 (inserts) + RF2 (deletes) — one cycle.
  5. Refresh all stream tables in DIFFERENTIAL mode.
  6. Compare every stream table’s contents against a fresh FULL refresh.
  7. Pass criterion: zero row differences across all 22 queries.

Phase T1-B: Performance (Scale Factors 1, 10, 100)

  1. For each scale factor:
    1. Load data, create stream tables, initial FULL refresh.
    2. Run 10 RF1+RF2 cycles with DIFFERENTIAL refresh.
    3. Run 10 RF1+RF2 cycles with FULL refresh.
    4. Record wall-clock time per refresh per query.
  2. Output: Speedup ratio table (FULL / DIFFERENTIAL) × query × scale factor.
  3. Integrate with the existing e2e_bench_tests.rs framework.

Phase T1-C: Sustained Churn

  1. SF-10. Run 100 RF1+RF2 cycles with DIFFERENTIAL refresh.
  2. After every 10th cycle, verify correctness against FULL refresh.
  3. Record cumulative drift, memory usage, change buffer table sizes.
  4. Pass criterion: zero cumulative drift; change buffers stay bounded.

Files

File Purpose
tests/tpch/schema.sql TPC-H DDL (8 tables with PKs)
tests/tpch/queries/ 22 .sql files (LIMIT/ORDER BY stripped)
tests/tpch/load.sh dbgen + COPY loader for a given SF
tests/e2e_tpch_tests.rs Correctness + benchmark test harness

Suite 2: sqllogictest — Broad Correctness

Why sqllogictest

Originally from SQLite, sqllogictest contains millions of SQL statements with expected results. CockroachDB, DuckDB, and DataFusion all maintain forks. Its value is breadth — covering SQL patterns no developer would think to write manually. For pg_trickle, the key test is: does a stream table in DIFFERENTIAL mode produce the same result as FULL mode for every query?

URL: https://www.sqlite.org/sqllogictest/
Rust runner: https://github.com/risinglightdb/sqllogictest-rs

Approach

We do not need to run the full sqllogictest corpus verbatim. Instead, we use it as a query generator:

  1. Filter the corpus for queries that use pg_trickle-supported features (joins, aggregates, subqueries, etc.) and exclude unsupported ones (LIMIT, GROUPING SETS, etc.).
  2. For each qualifying query Q over tables T₁…Tₙ:
    1. Create base tables, populate with seed data.
    2. Create stream table ST with defining query Q in DIFFERENTIAL mode.
    3. Initial FULL refresh → record contents as expected_full.
    4. Apply a small random change set to T₁…Tₙ.
    5. DIFFERENTIAL refresh → record contents as actual_diff.
    6. FULL refresh (from scratch) → record contents as expected_after.
    7. Assert: actual_diff == expected_after.
  3. Pass criterion: zero mismatches across the filtered corpus.

Scope

Focus on the test/index/random/ and test/select/ directories which contain the highest density of relevant analytical queries. Skip: - Tests that are SQLite-specific (type affinity, etc.) - Tests requiring LIMIT/OFFSET - Tests with non-deterministic functions

Estimated Query Count

After filtering: ~5,000–15,000 qualifying queries. Each adds ~100 ms of test time at small scale. Budget: 10–30 minutes for full correctness sweep.

Files

File Purpose
tests/sqllogictest/filter.py Filter corpus → qualifying queries
tests/sqllogictest/runner.rs Custom harness: create ST → mutate → compare
tests/e2e_slt_tests.rs Top-level test entry point

Suite 3: Join Order Benchmark (JOB) — Real-World Data

Why JOB

The Join Order Benchmark uses the real IMDB dataset (~3.6 GB) with 113 analytically complex queries, some joining 10+ tables. Its value is testing delta correctness under realistic data skew — foreign-key distributions that are far from uniform, unlike synthetic TPC-H data.

Paper: Leis et al., “How Good Are Query Optimizers, Really?” (PVLDB 2015)
URL: https://github.com/gregrahn/join-order-benchmark

Query Compatibility

JOB queries use INNER JOINs, WHERE with equality/range/LIKE, and minimal aggregation. Most require only ORDER BY / LIMIT removal. No GROUPING SETS, recursive CTEs, or window functions. Estimated compatibility: 110+/113 queries with trivial modifications.

Proposed Test Plan

  1. Load IMDB data (CSV import).
  2. Create stream tables for the 113 queries.
  3. Apply targeted mutations (INSERT/UPDATE/DELETE on high-skew tables like cast_info, movie_info).
  4. DIFFERENTIAL refresh → compare against FULL refresh.
  5. Focus metric: correctness under skewed join cardinalities.

Files

File Purpose
tests/job/schema.sql IMDB DDL (21 tables with PKs)
tests/job/queries/ 113 .sql files (adapted)
tests/e2e_job_tests.rs Correctness test harness

Suite 4: Feldera / DBSP Benchmarks — Theory Peer Comparison

Why Feldera

Since pg_trickle is grounded in DBSP theory, comparing against Feldera (the reference DBSP implementation) on the same workloads provides a meaningful performance baseline. Feldera maintains benchmarks for Nexmark (streaming auction events) and TPC-H variants.

URL: https://github.com/feldera/feldera/tree/main/benchmark

Approach

This is not an apples-to-apples comparison (Feldera is a standalone streaming engine; pg_trickle runs inside PostgreSQL). The goal is to understand the overhead of the PostgreSQL execution model vs. a dedicated dataflow runtime, and to identify queries where pg_trickle’s delta SQL generation is suboptimal.

Nexmark Benchmark

Nexmark models an online auction system with 3 event types (Person, Auction, Bid) and 22 queries. Several Nexmark queries use window functions and time-based grouping. Relevant subset for pg_trickle:

Query Features Compatible?
Q0 Passthrough (no-op)
Q1 Projection + arithmetic
Q2 Filter
Q3 Join + filter
Q4 Join + GROUP BY + AVG + MAX
Q5 Window: COUNT per time window ✅ (window function)
Q6 Join + AVG + window
Q7 MAX in time window
Q8 Join (person-auction within window)
Q9–Q22 Various combinations Most compatible

Proposed Test Plan

  1. Adapt Nexmark data generator for PostgreSQL tables.
  2. Create stream tables for compatible Nexmark queries.
  3. Measure:
    1. Throughput (events/sec processed per refresh cycle).
    2. Latency (time from event commit to reflected in stream table).
    3. Compare published Feldera numbers on same hardware class.
  4. Publish results in docs/BENCHMARK.md as a separate section.

Files

File Purpose
tests/nexmark/schema.sql Nexmark DDL (3 tables)
tests/nexmark/generator.sql Data generation via generate_series
tests/nexmark/queries/ Adapted Nexmark queries
tests/e2e_nexmark_tests.rs Benchmark harness

Implementation Roadmap

Phase Suite Effort Priority Value
T1 TPC-H correctness (SF-1) 2–3 days P0 Industry-standard validation; catches delta bugs in joins + aggregates
T2 TPC-H performance (SF-1/10/100) 2–3 days P1 Quantified speedup numbers for README and marketing
T3 sqllogictest (filtered corpus) 3–5 days P1 Maximum breadth; confident correctness claim
T4 JOB correctness 2 days P2 Real-world data skew validation
T5 Feldera/Nexmark comparison 3–5 days P2 Competitive positioning; identifies optimization gaps

Recommended start: T1 → T2 → T3. These three phases provide the highest signal-to-effort ratio and produce publishable benchmark numbers.


Infrastructure Requirements

Docker

All test suites run inside the existing E2E Testcontainers infrastructure. TPC-H at SF-100 requires ~100 GB disk and ~16 GB RAM; run on CI with a large runner or locally.

Data Generation

Suite Tool Data Size
TPC-H SF-1 dbgen (C tool, compiles in seconds) ~1 GB
TPC-H SF-10 dbgen ~10 GB
TPC-H SF-100 dbgen ~100 GB
sqllogictest Corpus download (SQLite repo) ~50 MB
JOB / IMDB CSV download from ftp.fu-berlin.de ~3.6 GB
Nexmark SQL generate_series Configurable

CI Integration

  • T1 (correctness): Add to CI as a weekly scheduled job (SF-1, ~10 min).
  • T2 (performance): Nightly job on dedicated hardware; results tracked in a benchmark history file.
  • T3 (sqllogictest): Weekly; flag regressions via exit code.
  • T4/T5: Manual or monthly; results published to docs/BENCHMARK.md.

Test Tagging

All external-suite tests should be #[ignore] (like existing benchmarks) and gated behind a just target:

just test-tpch          # TPC-H correctness (SF-1)
just bench-tpch         # TPC-H performance (SF-1/10/100)
just test-slt           # sqllogictest filtered corpus
just test-job           # Join Order Benchmark
just bench-nexmark      # Nexmark comparison benchmarks