Test Suites Evaluation & Hardening: Final Experience Report

Test Suites Evaluation & Hardening: Final Experience Report

Date: 2026-03-17 Scope: Final review and synthesis of the test suite hardening initiative across Unit, Integration, Light E2E, Full E2E, TPC-H, and Property-Based suites. Status: All planned testing evaluations and priority mitigations completed.

1. Executive Summary

Over the course of this initiative, we conducted a massive, deeply analytical evaluation of the pg_trickle test infrastructure, guided by six core assessment plans: - PLAN_TEST_PROPERTY_BASED_INVARIANTS.md - PLAN_TEST_EVALS_UNIT.md - PLAN_TEST_EVALS_LIGHT_E2E.md - PLAN_TEST_EVALS_INTEGRATION.md - PLAN_TEST_EVALS_FULL_E2E.md - PLAN_TEST_EVALS_TPCH.md

We moved beyond surface-level code coverage and confronted the quality of our assertions. We systematically replaced weak “fire-and-forget” checks with rigorous, order-independent data validation, significantly expanded our property-based testing for stateful operations, and stabilized our massive TPC-H and end-to-end integration environments.

The resulting test suites are substantially more aggressive, deterministic, and capable of proving the core Differential View Maintenance (DVM) invariants of pg_trickle under chaotic and concurrent workloads.

2. What Did We Learn?

The “Silent Pass” Phenomenon

Our biggest realization was that many of our legacy tests were executing complex code paths without fully verifying the outcomes. We learned that: * Counting rows is fundamentally insufficient for a differential streaming engine. Simply asserting COUNT(*) == 5 missed cases where duplicate rows were incorrectly collapsing or missing entirely. * Order-dependent assertions mask bugs and cause flakes. Comparing direct SQL outputs often failed due to normal PostgreSQL non-deterministic ordering rather than actual DVM logic errors. * Mocking CDC obscures the hardest bugs. Setting up fake CDC buffer state in unit tests could not replace live, execution-backed Postgres trigger tests. We learned that the boundary between our Rust logical replication decoding and the Postgres physical layer was our highest risk area.

It also forced us to re-evaluate our test pyramid. By pushing more multiset and property-based assertions down into pure-Rust and Light E2E tests, we reduced our reliance on the slow, monolithic Full E2E suite. This paradigm shift demonstrated that while E2E is necessary for true systemic confidence, meaningful validation requires rigorous math at every tier.

The Power of Stateful Property-Based Testing

We learned that static, handcrafted DML sequences (insert A, update B, delete A) rarely trigger edge-case bugs in differential dataflow. Only by adopting deterministic, randomized property-based testing (via proptest and our stateful invariant runners) could we reliably flush out complex interleaving bugs in multi-layer DAGs and cross-source joins. Specifically, chaotic sequence generation proved invaluable for finding places where our DVM delta aggregations transiently violated commutative and associative properties under high load.

3. How Much Better Are the Test Suites Now?

The difference in suite quality is night and day. We have transitioned from “smoke testing” to “mathematical invariant proving”.

Quantitative Improvements

Unit Tests: Expanded to over 1,300 robust tests across 44 files, with zero ignored or skipped tests, closing major coverage gaps in CDC decoding and DAG analysis.
Assertion Density: Thousands of legacy row-count checks were removed and replaced with full multiset equality assertions (assert_sets_equal()).
Infrastructure Speed: The transition of 47 files to the Light E2E suite (which avoids rebuilding the heavy custom Docker container) dramatically improved CI turnaround times on PRs.

Qualitative Improvements

Order-Independent Multiset Validation: By introducing assert_sets_equal globally across Integration, E2E, and TPC-H suites, our tests now rigorously assert true symmetric set differences. A test no longer passes if the right number of rows contains slightly wrong data.
Strict Invariant Proving: The test suites now actively assert the core DVM invariant (Base + CDC) == Recomputed State continuously throughout execution, not just at the end.
Round-Trip Coverage: Features like LISTEN/NOTIFY and background worker orchestration are now tested in live, continuous round-trip environments rather than just checking if the function ran without panicking.

4. Bugs Uncovered and Fixed

Strengthening our assertions immediately paid off. The hardened tests uncovered several latent bugs and regressions that had previously hidden behind weak validation:

The TPC-H Q13 Left Join Bug: Our deeper TPC-H validations exposed an issue where specific sequences of randomized DML (a heavily interleaved mix of deletes and inserts) caused our DVM left-join operator to mishandle exact tuple multiplicity. The new multiset tests caught this instantly.
CDC Replica Identity Mishandling: By removing mocked CDC data and running live pgrx::pg_test trigger flows, we caught an edge case in how we handled REPLICA IDENTITY FULL for tables with wide rows or toasted columns.
Partitioning Metadata Desync: Full E2E multiset testing around table partitions revealed a scenario where DDL changes (attaching/detaching a partition) temporarily desynced our DAG dependency tracker because our old tests only validated the parent table’s COUNT().
LISTEN/NOTIFY Missed Events: Hardening the monitoring_tests.rs with a real PgListener round-trip caught a race condition where rapidly completed refresh cycles could occasionally swallow notification emissions to the client.

These bugs were purely logical edge-cases that could not have been caught by standard line-coverage metrics. They were only exposed by enforcing rigorous, property-based state invariants.

5. Prioritized Action Plan & Next Steps

With the immediate hardening phase complete, our test environment is stable but must evolve alongside new core features. Below are the concrete, prioritized next steps for test infrastructure, ranked by risk and value:

Priority 0: Critical Stability & CI

Monitor Suite Execution Time Limits (P0-1): With the massive influx of property-based invariant checks, our testing pyramid is heavier. We must carefully profile test execution times—especially in the Light E2E and Full E2E tiers—and utilize cargo test concurrency strategies or cargo-nextest to keep PR feedback loops under 5-10 minutes. If runtimes exceed 15 minutes, we risk developer friction.
Continuous TPC-H Benchmarking vs. Correctness (P0-2): Now that the TPC-H suite asserts perfect data exactness via assert_sets_equal, we must tightly integrate this with our dedicated benchmarking pipelines. Currently, tests prove correctness and scripts prove performance. Future work must bridge these so we can prove a PR does not regress throughput while maintaining strict invariant correctness.

Priority 1: High-Value Edge Case Hunters

Expand Fuzzing to the SQL Parser & DAG Analyzer (P1-1): We heavily fuzz our DVM operators, but our SQL query parser and DAG analyzer remain largely tested via fixed, “happy path” queries. We should introduce structural SQL fuzzing using tools like sqlancer or a custom grammar fuzzer to generate weird (but syntactically valid) view definitions. This will prove our parser and dependency DAG handler fail gracefully rather than panicking on unexpected CTEs or nested window functions.
Chaos Testing Local Processes for Crash Stability (P1-2): Our background worker testing is substantially improved, but not fully “chaotic.” We should implement a specialized fault-injection E2E suite that randomly sends SIGKILL signals to backend Postgres processes (simulating OOMs or immediate pod eviction limits). This is necessary to rigorously prove that our WAL offsets, snapshot isolation bounds, and CDC buffer tables always recover exactly once without silent data loss when resumed.

Priority 2: Technical Debt & Ergonomics

Refactor Test Fixture Data Loading (P2-1): Currently, tests generate identical DDL and DML repetitively. We should centralize robust test-data generation helpers within tests/common/ to quickly summon common schema archetypes (e.g., highly normalized snowflaked schemas, flat wide-fact tables) to lower the boilerplate overhead for new test creation.

Conclusion

The shift from manual, example-based testing to invariant-driven, multiset-validated testing marks a maturation point for pg_trickle. We no longer just “believe” our differential dataflow engine works; our test suites mathematically prove it across hundreds of random mutations on every commit. The bugs we uncovered and fixed during this cycle validate the ROI of this deep testing overhaul.

PGXN

PostgreSQL Extension Network

Contents