PLAN: External Test Suites for pg_trickle

PLAN: External Test Suites for pg_trickle

Status: Proposed
Date: 2026-02-25
Branch: main
Scope: Adopt public SQL test suites and benchmarks to validate correctness and performance of stream tables beyond the project’s existing test infrastructure.

Motivation
Suite 1: TPC-H-Derived — Analytical Workload
Suite 2: sqllogictest — Broad Correctness
Suite 3: Join Order Benchmark (JOB) — Real-World Data
Suite 4: Feldera / DBSP Benchmarks — Theory Peer Comparison
Implementation Roadmap
Infrastructure Requirements

Motivation

pg_trickle’s existing test suite (878 unit tests, 22 E2E test suites) validates internal correctness. What it does not cover:

Industry-standard workloads — are stream tables effective on the queries real analysts write?
Scale — does differential refresh maintain its advantage at 1 GB, 10 GB, 100 GB?
Breadth — are there SQL patterns we haven’t thought to test that produce incorrect deltas?
Peer comparison — how does pg_trickle’s approach compare to standalone IVM engines like Feldera on the same workload?

Public test suites address all four gaps.

Suite 1: TPC-H-Derived — Analytical Workload

TPC-H Fair Use: Our test suite is derived from the TPC-H Benchmark specification but does not constitute a TPC-H Benchmark result. We use a custom SQL data generator (not dbgen), modified queries, and a non-standard RF3. “TPC-H” is a trademark of the Transaction Processing Performance Council (tpc.org).

Why a TPC-H-Derived Workload

The TPC-H specification defines an industry-standard decision-support schema and query set. Its 22 queries cover joins (up to 8 tables), aggregates, subqueries (scalar, EXISTS, IN), HAVING, CASE WHEN, DISTINCT — all operators pg_trickle supports in DIFFERENTIAL mode. Its two refresh functions (RF1: bulk INSERT, RF2: bulk DELETE) directly exercise the CDC → delta refresh pipeline.

Specification: https://www.tpc.org/tpch/
Our data generator: Custom pure-SQL (generate_series), not dbgen

Query Compatibility Analysis

All 22 TPC-H queries are compatible with pg_trickle:

Category	Queries	Modification
Works as-is	Q6, Q14, Q17, Q19	None
Remove ORDER BY only	Q1, Q4, Q5, Q7, Q8, Q9, Q11, Q12, Q13, Q15, Q16, Q20, Q22	Cosmetic — ORDER BY is silently ignored anyway
TopK (ORDER BY + LIMIT)	Q2, Q3, Q10, Q18, Q21	✅ Supported — ORDER BY + LIMIT now accepted as TopK pattern

0 queries are blocked by unsupported SQL features.
Q15 additionally requires inlining its CREATE VIEW as a CTE or subquery (pg_trickle requires a single SELECT statement).
No TPC-H query uses GROUPING SETS, NATURAL JOIN, or recursive CTEs.

Detailed Query Matrix

#	Name	Key SQL Features	Blocked?
Q1	Pricing Summary	GROUP BY, SUM/AVG/COUNT, WHERE (date)	No
Q2	Minimum Cost Supplier	Correlated scalar subquery (MIN), 8-table join	TopK
Q3	Shipping Priority	3-table join, GROUP BY, SUM	TopK
Q4	Order Priority Checking	EXISTS subquery, GROUP BY, COUNT	No
Q5	Local Supplier Volume	6-table join, GROUP BY, SUM	No
Q6	Forecasting Revenue	Single-table SUM, WHERE filters	No
Q7	Volume Shipping	6-table join, CASE WHEN, SUM	No
Q8	National Market Share	8-table join, CASE WHEN, subquery in FROM	No
Q9	Product Type Profit	6-table join, expressions, LIKE	No
Q10	Returned Item Reporting	4-table join, GROUP BY, SUM	TopK
Q11	Important Stock ID	HAVING with scalar subquery, 3-table join	No
Q12	Shipping Modes	SUM(CASE WHEN), IN, BETWEEN	No
Q13	Customer Distribution	LEFT OUTER JOIN, nested GROUP BY, subquery in FROM	No
Q14	Promotion Effect	Conditional SUM ratio	No
Q15	Top Supplier	View → inline as CTE; MAX subquery	Inline view
Q16	Parts/Supplier	COUNT(DISTINCT), NOT IN subquery, NOT LIKE	No
Q17	Small-Quantity Revenue	Scalar subquery (AVG), 2-table join	No
Q18	Large Volume Customer	IN subquery with HAVING, 3-table join	TopK
Q19	Discounted Revenue	Complex OR/AND WHERE, SUM	No
Q20	Potential Promotion	Nested IN subqueries (2 levels)	No
Q21	Suppliers Waiting	EXISTS + NOT EXISTS, multi-join	TopK
Q22	Global Sales Opportunity	NOT EXISTS, scalar subquery, SUBSTRING	No

SQL Feature Coverage by TPC-H

SQL Feature	TPC-H Queries Using It	pg_trickle Support
INNER JOIN	Q2,Q3,Q5,Q7–Q12,Q15,Q17–Q21	✅ Full
LEFT OUTER JOIN	Q13	✅ Full
Multi-table join (3+)	Q2,Q3,Q5,Q7–Q11,Q18,Q20,Q21	✅ Full
GROUP BY	Q1,Q3–Q5,Q7–Q13,Q16,Q18,Q22	✅ Full
SUM	Q1,Q3,Q5–Q10,Q12,Q14,Q17–Q19	✅ Algebraic
AVG	Q1,Q17,Q22	✅ Algebraic
COUNT / COUNT(DISTINCT)	Q1,Q4,Q13,Q16,Q21	✅ Algebraic / ref-counted
MIN / MAX	Q2, Q15	✅ Semi-algebraic
HAVING	Q11,Q16,Q18	✅ Full
CASE WHEN	Q7,Q8,Q12,Q14,Q22	✅ Full
EXISTS / NOT EXISTS	Q4,Q21,Q22	✅ Semi-join / anti-join
IN / NOT IN subquery	Q12,Q16,Q18,Q20	✅ Semi-join / anti-join
Scalar subquery	Q2,Q11,Q15,Q17,Q22	✅ Full
Subquery in FROM	Q8,Q13,Q15,Q22	✅ Full
BETWEEN	Q1,Q3–Q6,Q12,Q15,Q20	✅ Full
LIKE / NOT LIKE	Q9,Q13,Q16	✅ Full
LIMIT (TopK)	Q2,Q3,Q10,Q18,Q21	✅ Supported (ORDER BY + LIMIT)

Refresh Functions

Function	Description	CDC Path
RF1	Bulk INSERT into `orders` + `lineitem`	Triggers capture → differential refresh
RF2	DELETE from `orders` + `lineitem` by key	Triggers capture → differential refresh

Both RF1 and RF2 exercise pg_trickle’s core pipeline. No special handling needed — INSERTs and DELETEs are the native change types that CDC captures.

Proposed Test Plan

Phase T1-A: Correctness (Scale Factor 1, ~1 GB)

Load TPC-H SF-1 data using dbgen.
Create 22 stream tables (one per query, with LIMIT/ORDER BY removed).
Perform initial FULL refresh for each.
Run RF1 (inserts) + RF2 (deletes) — one cycle.
Refresh all stream tables in DIFFERENTIAL mode.
Compare every stream table’s contents against a fresh FULL refresh.
Pass criterion: zero row differences across all 22 queries.

Phase T1-B: Performance (Scale Factors 1, 10, 100)

For each scale factor:
1. Load data, create stream tables, initial FULL refresh.
2. Run 10 RF1+RF2 cycles with DIFFERENTIAL refresh.
3. Run 10 RF1+RF2 cycles with FULL refresh.
4. Record wall-clock time per refresh per query.
Output: Speedup ratio table (FULL / DIFFERENTIAL) × query × scale factor.
Integrate with the existing e2e_bench_tests.rs framework.

Phase T1-C: Sustained Churn

SF-10. Run 100 RF1+RF2 cycles with DIFFERENTIAL refresh.
After every 10th cycle, verify correctness against FULL refresh.
Record cumulative drift, memory usage, change buffer table sizes.
Pass criterion: zero cumulative drift; change buffers stay bounded.

Files

File	Purpose
`tests/tpch/schema.sql`	TPC-H DDL (8 tables with PKs)
`tests/tpch/queries/`	22 `.sql` files (LIMIT/ORDER BY stripped)
`tests/tpch/load.sh`	`dbgen` + `COPY` loader for a given SF
`tests/e2e_tpch_tests.rs`	Correctness + benchmark test harness

Suite 2: sqllogictest — Broad Correctness

Why sqllogictest

Originally from SQLite, sqllogictest contains millions of SQL statements with expected results. CockroachDB, DuckDB, and DataFusion all maintain forks. Its value is breadth — covering SQL patterns no developer would think to write manually. For pg_trickle, the key test is: does a stream table in DIFFERENTIAL mode produce the same result as FULL mode for every query?

URL: https://www.sqlite.org/sqllogictest/
Rust runner: https://github.com/risinglightdb/sqllogictest-rs

Approach

We do not need to run the full sqllogictest corpus verbatim. Instead, we use it as a query generator:

Filter the corpus for queries that use pg_trickle-supported features (joins, aggregates, subqueries, etc.) and exclude unsupported ones (LIMIT, GROUPING SETS, etc.).
For each qualifying query Q over tables T₁…Tₙ:
1. Create base tables, populate with seed data.
2. Create stream table ST with defining query Q in DIFFERENTIAL mode.
3. Initial FULL refresh → record contents as expected_full.
4. Apply a small random change set to T₁…Tₙ.
5. DIFFERENTIAL refresh → record contents as actual_diff.
6. FULL refresh (from scratch) → record contents as expected_after.
7. Assert: actual_diff == expected_after.
Pass criterion: zero mismatches across the filtered corpus.

Scope

Focus on the test/index/random/ and test/select/ directories which contain the highest density of relevant analytical queries. Skip: - Tests that are SQLite-specific (type affinity, etc.) - Tests requiring LIMIT/OFFSET - Tests with non-deterministic functions

Estimated Query Count

After filtering: ~5,000–15,000 qualifying queries. Each adds ~100 ms of test time at small scale. Budget: 10–30 minutes for full correctness sweep.

Files

File	Purpose
`tests/sqllogictest/filter.py`	Filter corpus → qualifying queries
`tests/sqllogictest/runner.rs`	Custom harness: create ST → mutate → compare
`tests/e2e_slt_tests.rs`	Top-level test entry point

Suite 3: Join Order Benchmark (JOB) — Real-World Data

Why JOB

The Join Order Benchmark uses the real IMDB dataset (~3.6 GB) with 113 analytically complex queries, some joining 10+ tables. Its value is testing delta correctness under realistic data skew — foreign-key distributions that are far from uniform, unlike synthetic TPC-H data.

Paper: Leis et al., “How Good Are Query Optimizers, Really?” (PVLDB 2015)
URL: https://github.com/gregrahn/join-order-benchmark

Query Compatibility

JOB queries use INNER JOINs, WHERE with equality/range/LIKE, and minimal aggregation. Most require only ORDER BY / LIMIT removal. No GROUPING SETS, recursive CTEs, or window functions. Estimated compatibility: 110+/113 queries with trivial modifications.

Proposed Test Plan

Load IMDB data (CSV import).
Create stream tables for the 113 queries.
Apply targeted mutations (INSERT/UPDATE/DELETE on high-skew tables like cast_info, movie_info).
DIFFERENTIAL refresh → compare against FULL refresh.
Focus metric: correctness under skewed join cardinalities.

Files

File	Purpose
`tests/job/schema.sql`	IMDB DDL (21 tables with PKs)
`tests/job/queries/`	113 `.sql` files (adapted)
`tests/e2e_job_tests.rs`	Correctness test harness

Suite 4: Feldera / DBSP Benchmarks — Theory Peer Comparison

Why Feldera

Since pg_trickle is grounded in DBSP theory, comparing against Feldera (the reference DBSP implementation) on the same workloads provides a meaningful performance baseline. Feldera maintains benchmarks for Nexmark (streaming auction events) and TPC-H variants.

URL: https://github.com/feldera/feldera/tree/main/benchmark

Approach

This is not an apples-to-apples comparison (Feldera is a standalone streaming engine; pg_trickle runs inside PostgreSQL). The goal is to understand the overhead of the PostgreSQL execution model vs. a dedicated dataflow runtime, and to identify queries where pg_trickle’s delta SQL generation is suboptimal.

Nexmark Benchmark

Nexmark models an online auction system with 3 event types (Person, Auction, Bid) and 22 queries. Several Nexmark queries use window functions and time-based grouping. Relevant subset for pg_trickle:

Query	Features	Compatible?
Q0	Passthrough (no-op)	✅
Q1	Projection + arithmetic	✅
Q2	Filter	✅
Q3	Join + filter	✅
Q4	Join + GROUP BY + AVG + MAX	✅
Q5	Window: COUNT per time window	✅ (window function)
Q6	Join + AVG + window	✅
Q7	MAX in time window	✅
Q8	Join (person-auction within window)	✅
Q9–Q22	Various combinations	Most compatible

Proposed Test Plan

Adapt Nexmark data generator for PostgreSQL tables.
Create stream tables for compatible Nexmark queries.
Measure:
1. Throughput (events/sec processed per refresh cycle).
2. Latency (time from event commit to reflected in stream table).
3. Compare published Feldera numbers on same hardware class.
Publish results in docs/BENCHMARK.md as a separate section.

Files

File	Purpose
`tests/nexmark/schema.sql`	Nexmark DDL (3 tables)
`tests/nexmark/generator.sql`	Data generation via `generate_series`
`tests/nexmark/queries/`	Adapted Nexmark queries
`tests/e2e_nexmark_tests.rs`	Benchmark harness

Implementation Roadmap

Phase	Suite	Effort	Priority	Value
T1	TPC-H correctness (SF-1)	2–3 days	P0	Industry-standard validation; catches delta bugs in joins + aggregates
T2	TPC-H performance (SF-1/10/100)	2–3 days	P1	Quantified speedup numbers for README and marketing
T3	sqllogictest (filtered corpus)	3–5 days	P1	Maximum breadth; confident correctness claim
T4	JOB correctness	2 days	P2	Real-world data skew validation
T5	Feldera/Nexmark comparison	3–5 days	P2	Competitive positioning; identifies optimization gaps

Recommended start: T1 → T2 → T3. These three phases provide the highest signal-to-effort ratio and produce publishable benchmark numbers.

Infrastructure Requirements

Docker

All test suites run inside the existing E2E Testcontainers infrastructure. TPC-H at SF-100 requires ~100 GB disk and ~16 GB RAM; run on CI with a large runner or locally.

Data Generation

Suite	Tool	Data Size
TPC-H SF-1	`dbgen` (C tool, compiles in seconds)	~1 GB
TPC-H SF-10	`dbgen`	~10 GB
TPC-H SF-100	`dbgen`	~100 GB
sqllogictest	Corpus download (SQLite repo)	~50 MB
JOB / IMDB	CSV download from ftp.fu-berlin.de	~3.6 GB
Nexmark	SQL `generate_series`	Configurable

CI Integration

T1 (correctness): Add to CI as a weekly scheduled job (SF-1, ~10 min).
T2 (performance): Nightly job on dedicated hardware; results tracked in a benchmark history file.
T3 (sqllogictest): Weekly; flag regressions via exit code.
T4/T5: Manual or monthly; results published to docs/BENCHMARK.md.

Test Tagging

All external-suite tests should be #[ignore] (like existing benchmarks) and gated behind a just target:

just test-tpch          # TPC-H correctness (SF-1)
just bench-tpch         # TPC-H performance (SF-1/10/100)
just test-slt           # sqllogictest filtered corpus
just test-job           # Join Order Benchmark
just bench-nexmark      # Nexmark comparison benchmarks

PGXN

PostgreSQL Extension Network

Contents

PLAN: External Test Suites for pg_trickle

Table of Contents

Motivation

Suite 1: TPC-H-Derived — Analytical Workload

Why a TPC-H-Derived Workload

Query Compatibility Analysis

Detailed Query Matrix

SQL Feature Coverage by TPC-H

Refresh Functions

Proposed Test Plan

Files

Suite 2: sqllogictest — Broad Correctness

Why sqllogictest

Approach

Scope

Estimated Query Count

Files

Suite 3: Join Order Benchmark (JOB) — Real-World Data

Why JOB

Query Compatibility

Proposed Test Plan

Files

Suite 4: Feldera / DBSP Benchmarks — Theory Peer Comparison

Why Feldera

Approach

Nexmark Benchmark

Proposed Test Plan

Files

Implementation Roadmap

Infrastructure Requirements

Docker

Data Generation

CI Integration

Test Tagging