Plan: Last Differential Mode Gaps

Date: 2026-03-15 Status: Complete — G1 + G3 fully implemented (including De Morgan + multi-pass); G2 was done in a prior release


1. Problem Statement

pg_trickle’s DIFFERENTIAL refresh mode covers a very wide range of SQL constructs. However, three categories of queries still silently fall back to FULL when AUTO mode is selected (or produce hard errors when explicit DIFFERENTIAL is requested):

# Gap Estimated Frequency Effort
G1 User-defined aggregates (UDAs) Medium (analytics, PostGIS, pgvector) Medium
G2 Window functions nested in expressions Medium (ranking/pagination queries) Medium
G3 Sublinks in deeply nested OR Low (complex filter trees) Small–Medium

All three gaps share a common trait: the DVM engine already has the infrastructure to handle them (group-rescan for aggregates, partition recomputation for windows, UNION rewrite for OR+sublinks), but the parser rejects the patterns before that infrastructure can be applied.

Out of Scope

The following FULL-only patterns are fundamentally incompatible with differential semantics and are not addressed here:

  • Materialized views / foreign tables as sources — no CDC triggers.
  • VOLATILE functions (random(), clock_timestamp()) — non-deterministic.
  • TABLESAMPLE — non-deterministic sampling.
  • FOR UPDATE / FOR SHARE — row-level locking has no differential meaning.
  • LIMIT without ORDER BY — non-deterministic row selection.
  • LATERAL with RIGHT/FULL JOIN — rejected by PostgreSQL itself (lateral reference dependency conflicts with RIGHT/FULL semantics).

2. Current State

How AUTO Fallback Works

validate_and_parse_query() in src/api.rs (L540) runs five sequential checkpoints. If any checkpoint fails and mode is AUTO, it downgrades to FULL:

  1. reject_unsupported_constructs() — TABLESAMPLE, FOR UPDATE, nested OR sublinks
  2. reject_materialized_views() — matviews, foreign tables
  3. validate_immediate_mode_support() — IMMEDIATE-specific checks
  4. parse_defining_query_full() + check_ivm_support_with_registry() — parse into OpTree; any failure → FULL
  5. Volatility check — VOLATILE → hard error; STABLE → warning

Gaps G1, G2, and G3 all fail at checkpoint 4 — the OpTree parser rejects them, and AUTO gracefully falls back to FULL.

Key Code Locations

Component File Entry Point
AUTO fallback orchestrator src/api.rs:540 validate_and_parse_query()
Aggregate classification src/dvm/parser.rs:13558 is_known_aggregate()
UDA detection src/dvm/parser.rs:13534 is_user_defined_aggregate()
Aggregate extraction src/dvm/parser.rs:13772 extract_aggregates() rejection
Window-in-expr detection src/dvm/parser.rs:12787 node_contains_window_func()
Window-in-expr rejection src/dvm/parser.rs:13166 extract_window_exprs()
Window diff operator src/dvm/operators/window.rs diff_window()
Aggregate diff operator src/dvm/operators/aggregate.rs diff_aggregate()
OR+sublink rewrite src/dvm/parser.rs:6242 rewrite_sublinks_in_or()
AND+OR+sublink rewrite src/dvm/parser.rs:6395 rewrite_and_with_or_sublinks()
OR+sublink rejection (stage 2) src/dvm/parser.rs:8527 extract_where_sublinks()
AggFunc enum src/dvm/parser.rs:164 AggFunc variants
Group-rescan flag src/dvm/parser.rs:342 AggFunc::is_group_rescan()

3. Gap Details & Implementation Plans


G1: User-Defined Aggregates (UDAs)

Problem

Any aggregate function not in the hard-coded is_known_aggregate() list (~45 built-in functions) is rejected from DIFFERENTIAL mode. This affects:

  • PostGIS: ST_Union(), ST_Collect(), ST_Extent(), ST_MakeLine()
  • pgvector: avg() on vector types (custom aggregate)
  • pg_trgm / hunspell: custom text aggregates
  • User-created aggregates: CREATE AGGREGATE my_percentile(...)

The current rejection is at src/dvm/parser.rs L13772:

} else if is_user_defined_aggregate(&name_lower) {
    return Err(PgTrickleError::UnsupportedOperator(format!(
        "User-defined aggregate function {func_name}() is not supported \
         in DIFFERENTIAL mode. ..."
    )));

Insight

The DVM’s group-rescan strategy already handles many complex aggregates (STRING_AGG, ARRAY_AGG, all statistical functions, etc.) by re-aggregating entire groups from the base data when any row in the group changes. This exact same strategy can safely support any UDA — it makes no assumptions about the aggregate’s internal semantics.

Design

Approach: Treat unknown aggregates as group-rescan aggregates. Rather than rejecting them, classify them as a new AggFunc::UserDefined { name, schema } variant that always uses the group-rescan strategy.

Step 1: Extend AggFunc Enum

File: src/dvm/parser.rs

pub enum AggFunc {
    // ... existing variants ...
    /// User-defined or unrecognized aggregate — always group-rescan.
    UserDefined {
        name: String,
        schema: Option<String>,
    },
}

Update is_group_rescan() to return true for UserDefined.

Update to_sql_name() (or equivalent) to emit schema.name() or just name() depending on whether schema qualification is present.

Step 2: Modify Aggregate Extraction

File: src/dvm/parser.rs, function extract_aggregates() (~L13772)

Replace the UDA rejection with:

} else if is_user_defined_aggregate(&name_lower) {
    // Treat as group-rescan: re-aggregate the entire group on any change.
    AggFunc::UserDefined {
        name: name_lower.clone(),
        schema: extract_schema_from_funcname(funcname),
    }
}
Step 3: Update Aggregate Diff Operator

File: src/dvm/operators/aggregate.rs

The group-rescan code path in diff_aggregate() already handles group-rescan aggregates generically — it emits:

SELECT group_cols, AGG_FUNC(args) FROM source WHERE group_key IN (changed_groups)

The UserDefined variant needs to:

  1. Emit the function call with correct schema qualification.
  2. Preserve any special syntax (ORDER BY within aggregate, FILTER clause, DISTINCT keyword).
  3. Handle the case where the UDA uses ordered-set syntax (WITHIN GROUP (ORDER BY ...)).

Review render_agg_call() (or equivalent) to ensure it can emit arbitrary function names rather than only hard-coded ones.

Step 4: Handle Edge Cases
  • Schema-qualified UDAs: myschema.my_agg(x) — preserve schema in AggFunc::UserDefined.
  • Overloaded names: A UDA with the same name as a built-in (unlikely but possible if schema-qualified) — prefer the built-in classification if unqualified, use UserDefined if schema-qualified to a non-pg_catalog schema.
  • Aggregate with FILTER: my_agg(x) FILTER (WHERE cond) — the existing FILTER handling in the aggregate operator should work unchanged since group-rescan re-evaluates the full expression.
  • Aggregate with ORDER BY: string_agg(x, ',' ORDER BY y) — ensure the ORDER BY clause is preserved in the rendered SQL. The existing group-rescan path already handles this for STRING_AGG; verify it generalizes.
  • Ordered-set aggregates: my_percentile(0.5) WITHIN GROUP (ORDER BY x) — ensure WITHIN GROUP clause is preserved. The existing PERCENTILE_CONT handling is a reference implementation.
Step 5: Catalog Validation

Add a lightweight check at stream table creation time that verifies the UDA exists in pg_proc with prokind = 'a'. This is already done by is_user_defined_aggregate() — reuse it. If the UDA is dropped later, the refresh will fail at execution time with a clear PostgreSQL error (function does not exist), which is acceptable.

Step 6: Tests
Test Description
Unit: test_uda_classified_as_group_rescan Verify AggFunc::UserDefined returns is_group_rescan() == true
Unit: test_uda_sql_rendering Verify schema-qualified and unqualified UDA SQL emission
E2E: test_uda_simple_differential CREATE AGGREGATE → use in stream table → verify DIFFERENTIAL mode
E2E: test_uda_with_filter_clause UDA with FILTER (WHERE …)
E2E: test_uda_with_order_by UDA with ORDER BY inside aggregate
E2E: test_uda_ordered_set UDA using WITHIN GROUP syntax
E2E: test_uda_schema_qualified Schema-qualified UDA myschema.my_agg()
E2E: test_uda_insert_delete_update Full CDC cycle with UDA in GROUP BY query
E2E: test_uda_multiple_in_same_query Multiple UDAs in same SELECT
E2E: test_uda_combined_with_builtin UDA + COUNT + SUM in same query
Effort Estimate
  • Parser changes: ~60 lines
  • Aggregate operator changes: ~30 lines (mostly rendering)
  • Tests: ~300 lines
  • Risk: Low — group-rescan is a proven strategy; no new delta math required

G2: Window Functions Nested in Expressions

Problem

Window functions that appear inside expressions are rejected:

-- All rejected:
SELECT RANK() OVER (...) + 1 AS adjusted_rank FROM t
SELECT CASE WHEN ROW_NUMBER() OVER (...) <= 10 THEN 'top' ELSE 'rest' END FROM t
SELECT COALESCE(LAG(value) OVER (...), 0) AS prev_value FROM t
SELECT CAST(ROW_NUMBER() OVER (...) AS text) FROM t

The rejection is at src/dvm/parser.rs L13166:

if unsafe { node_contains_window_func(rt.val) } {
    return Err(PgTrickleError::UnsupportedOperator(
        "Window functions nested inside expressions ..."
    ));
}

The DVM’s window operator (src/dvm/operators/window.rs) already uses partition-based recomputation: when any row in a partition changes, the entire partition’s window results are recomputed. This means the operator doesn’t need to understand the expression wrapping the window function — it just needs to recompute the full SELECT expression for affected partitions.

Design

Approach: Auto-rewrite the defining query to extract nested window functions into an inner subquery, then apply the outer expression in a wrapper.

Strategy: Two-layer CTE rewrite at parse time.

Step 1: Detect Window-in-Expression Patterns

File: src/dvm/parser.rs

The existing node_contains_window_func() already detects these patterns. Add a companion function extract_window_from_expression() that:

  1. Walks the expression tree.
  2. For each window function found, replaces it with a synthetic column reference (__pgt_win_N).
  3. Returns a list of (synthetic_name, window_function_sql) pairs plus the rewritten outer expression.

Example:

-- Input target:
RANK() OVER (PARTITION BY dept ORDER BY salary DESC) + 1 AS adjusted_rank

-- Extracted:
--   inner column: RANK() OVER (PARTITION BY dept ORDER BY salary DESC) AS __pgt_win_1
--   outer expression: __pgt_win_1 + 1 AS adjusted_rank
Step 2: Rewrite Query with Inner Subquery

File: src/dvm/parser.rs, new function rewrite_nested_window_functions()

Called from the query rewrite pipeline (run_query_rewrite_pipeline() in src/api.rs), after existing rewrites but before DVM parsing.

Transform:

-- Input:
SELECT id,
       RANK() OVER (PARTITION BY dept ORDER BY salary DESC) + 1 AS adjusted_rank,
       COALESCE(LAG(salary) OVER (PARTITION BY dept ORDER BY hire_date), 0) AS prev_sal,
       name
FROM employees
WHERE active = true

-- Output:
SELECT "__pgt_winbase"."id",
       "__pgt_winbase"."__pgt_win_1" + 1 AS "adjusted_rank",
       COALESCE("__pgt_winbase"."__pgt_win_2", 0) AS "prev_sal",
       "__pgt_winbase"."name"
FROM (
    SELECT id, name,
           RANK() OVER (PARTITION BY dept ORDER BY salary DESC) AS "__pgt_win_1",
           LAG(salary) OVER (PARTITION BY dept ORDER BY hire_date) AS "__pgt_win_2"
    FROM employees
    WHERE active = true
) AS "__pgt_winbase"

The inner subquery becomes a standard window function query (already supported), and the outer SELECT is a simple projection (also supported).

Step 3: Handle Edge Cases
  • Multiple window functions in one expression: RANK() OVER w1 - DENSE_RANK() OVER w2 → extract both as separate __pgt_win_N columns. The inner subquery can contain multiple window functions.

  • Window function in WHERE clause: PostgreSQL itself rejects this (ERROR: window functions are not allowed in WHERE), so no handling needed.

  • Window function in GROUP BY / HAVING: Also rejected by PostgreSQL.

  • Window function in ORDER BY: Already handled by existing window support.

  • Nested expressions containing window + aggregate: SUM(x) + ROW_NUMBER() OVER (...) — PostgreSQL allows this in SELECT. The rewrite must place the aggregate in the inner subquery alongside the window function, or more precisely, the inner subquery should preserve the original query’s GROUP BY and aggregates while adding the window columns.

    However, PostgreSQL does not allow window functions and aggregates to coexist unless the window function operates on the aggregate result: SUM(x) + ROW_NUMBER() OVER (ORDER BY SUM(x)) is valid. In this case, the inner subquery should include the aggregates and the window function, and the outer SELECT applies the wrapping expression.

  • DISTINCT + nested window: The inner subquery preserves all rows (no DISTINCT); DISTINCT moves to the outer query if originally present.

  • ORDER BY + LIMIT (TopK): If the original query has ORDER BY + LIMIT, these stay on the outer query. The inner subquery produces all rows with window columns.

  • Aliased window definitions (WINDOW w AS (PARTITION BY ...)): The inner subquery must carry the WINDOW clause.

Step 4: Parser Integration

The rewrite should happen in run_query_rewrite_pipeline() (in src/api.rs), inserted after rewrite_rows_from() and before parse_defining_query_full():

1. rewrite_scalar_subquery_in_where()
2. rewrite_correlated_scalar_in_select()
3. rewrite_sublinks_in_or()
4. rewrite_rows_from()
5. rewrite_grouping_sets()
6. rewrite_distinct_on()
7. rewrite_nested_window_functions()    ← NEW
8. parse_defining_query_full()

Remove the rejection in extract_window_exprs() — after the rewrite, window functions will always appear at the top level of SELECT targets.

Step 5: Tests
Test Description
Unit: test_extract_window_from_expression Pure logic: expression tree → synthetic columns + outer expr
Unit: test_rewrite_nested_window_arithmetic RANK() + 1 → inner/outer split
Unit: test_rewrite_nested_window_case CASE WHEN ROW_NUMBER() > 5 THEN ...
Unit: test_rewrite_nested_window_coalesce COALESCE(LAG(x), 0)
Unit: test_rewrite_nested_window_cast CAST(ROW_NUMBER() AS text)
Unit: test_rewrite_nested_window_multiple Two window funcs in one expression
Unit: test_rewrite_preserves_plain_windows Non-nested windows pass through unchanged
E2E: test_nested_window_rank_plus_one Full cycle: create, insert, refresh, verify
E2E: test_nested_window_coalesce_lag COALESCE(LAG(…), default)
E2E: test_nested_window_case_row_number CASE on ROW_NUMBER for top-N labeling
E2E: test_nested_window_multiple_exprs Multiple nested windows in same query
E2E: test_nested_window_with_group_by Nested window alongside aggregate
E2E: test_nested_window_insert_delete CDC cycle with partition changes
Effort Estimate
  • Expression extractor: ~120 lines
  • Query rewriter: ~200 lines
  • Parser integration: ~20 lines (add call + remove rejection)
  • Tests: ~400 lines
  • Risk: Medium — expression tree walking and SQL reconstruction is complex; the inner/outer split must preserve all clauses correctly. Edge cases around aggregate + window interaction need careful handling.

G3: Sublinks in Deeply Nested OR

Problem

The two-stage rewrite pipeline handles most OR + sublink patterns:

Pattern Status
WHERE EXISTS(...) OR x > 5 ✓ Stage 1 → UNION
WHERE a = 1 AND (EXISTS(...) OR x > 5) ✓ Stage 1 → AND-prefix + UNION
WHERE a AND (b AND (c OR EXISTS(...))) ✓ AND-flattening rescues this
WHERE (a OR EXISTS(...)) AND (b OR NOT EXISTS(...)) ✗ Only first OR rewritten
WHERE NOT (a AND NOT EXISTS(...)) ✗ De Morgan not applied
WHERE a OR (b AND EXISTS(...)) ✗ Mixed AND/OR under top-level OR

The failing patterns reach extract_where_sublinks() (L8527) with residual OR+sublink nodes and are rejected:

if inner_bool.boolop == pg_sys::BoolExprType::OR_EXPR
    && unsafe { node_tree_contains_sublink(arg_ptr) }
{
    return Err(PgTrickleError::UnsupportedOperator(
        "Subquery expressions (EXISTS, IN) inside OR conditions ..."
    ));
}

Design

Approach: Extend the OR+sublink rewrite pipeline to handle additional nesting patterns through recursive normalization and multi-pass rewriting.

Step 1: Recursive OR+Sublink Normalization

File: src/dvm/parser.rs

Add a new function normalize_or_sublinks() that applies transformations recursively until a fixed point (no more changes):

Transform A — Multiple OR+sublink conjuncts:

Currently rewrite_and_with_or_sublinks() only processes the first OR+sublink conjunct it finds. Extend it to process all of them by iterating:

WHERE (a OR EXISTS(s1)) AND (b OR NOT EXISTS(s2))

Pass 1: Rewrite first OR → UNION branches, each still has second OR+sublink
Pass 2: Rewrite second OR inside each UNION branch
         (each branch is now a simple SELECT, so top-level OR rewrite applies)

Result:
  (SELECT ... WHERE a AND b) UNION
  (SELECT ... WHERE a AND NOT EXISTS(s2)) UNION
  (SELECT ... WHERE EXISTS(s1) AND b) UNION
  (SELECT ... WHERE EXISTS(s1) AND NOT EXISTS(s2))

Implementation: After the first rewrite pass produces a UNION, recursively apply rewrite_sublinks_in_or() to each UNION branch. Cap recursion depth at 3 to prevent pathological explosion.

Transform B — Mixed AND/OR under top-level OR:

WHERE a OR (b AND EXISTS(s1))

-- This is already a top-level OR. The issue is that one arm is an AND
-- containing a sublink. The current UNION rewrite handles this correctly
-- because each OR arm becomes a separate UNION branch:
  (SELECT ... WHERE a) UNION (SELECT ... WHERE b AND EXISTS(s1))

-- The second branch has EXISTS under AND, which extract_where_sublinks()
-- handles natively. This should already work.

Verify and add a test — this may already be supported but untested.

Transform C — NOT wrapping OR+sublink (De Morgan):

WHERE NOT (a AND NOT EXISTS(s1))

-- Apply De Morgan's law:
WHERE NOT a OR EXISTS(s1)

-- Now it's a top-level OR with a sublink → UNION rewrite applies.

Add a De Morgan normalization pass that pushes NOT inward through AND/OR before the UNION rewrite runs. Implementation:

fn push_not_inward(node: *mut pg_sys::Node) -> Result<*mut pg_sys::Node, PgTrickleError>

Rules: - NOT (a AND b)(NOT a) OR (NOT b) - NOT (a OR b)(NOT a) AND (NOT b) - NOT NOT aa - NOT EXISTS(s)NOT EXISTS(s) (leave as-is; already handled)

Only apply when the expression tree contains sublinks inside OR after NOT push-down would expose them to the UNION rewrite.

Step 2: Multi-Pass Pipeline

File: src/api.rs, function run_query_rewrite_pipeline()

Modify the OR+sublink rewrite to run iteratively:

// Rewrite sublinks in OR (may need multiple passes for nested patterns)
let mut prev = query.clone();
for _ in 0..3 {
    query = rewrite_sublinks_in_or(&query)?;
    if query == prev {
        break;  // Fixed point reached
    }
    prev = query.clone();
}

The fixed-point loop ensures that UNION branches created in pass 1 get their own OR+sublink patterns rewritten in pass 2.

Step 3: Guard Against Combinatorial Explosion

When an AND has N OR+sublink conjuncts, the UNION rewrite produces up to $2N$ branches. Add a configurable limit:

const MAX_UNION_BRANCHES: usize = 16;  // 4 OR+sublink conjuncts

if branch_count > MAX_UNION_BRANCHES {
    return Err(PgTrickleError::UnsupportedOperator(format!(
        "Query would produce {} UNION branches after OR+sublink rewriting \
         (limit: {}). Simplify the WHERE clause or use FULL refresh mode.",
        branch_count, MAX_UNION_BRANCHES,
    )));
}

This protects against queries like: sql WHERE (a OR EXISTS(s1)) AND (b OR EXISTS(s2)) AND (c OR EXISTS(s3)) AND (d OR EXISTS(s4)) AND (e OR EXISTS(s5)) -- Would produce 2^5 = 32 UNION branches

Step 4: Tests
Test Description
Unit: test_normalize_multiple_or_sublink_conjuncts Two OR+sublink arms in AND
Unit: test_normalize_not_demorgan NOT (a AND NOT EXISTS(…))
Unit: test_normalize_mixed_and_or_under_or a OR (b AND EXISTS(...))
Unit: test_normalize_triple_or_sublink Three OR+sublink arms (8 branches)
Unit: test_normalize_explosion_guard 5+ OR+sublink arms → error
Unit: test_normalize_fixed_point Verify loop terminates
E2E: test_nested_or_two_exists Two OR arms with EXISTS, differential
E2E: test_nested_or_demorgan NOT(AND(NOT EXISTS)) pattern
E2E: test_nested_or_mixed a OR (b AND EXISTS(...))
E2E: test_nested_or_cdc_cycle Insert/delete with nested OR pattern
Effort Estimate
  • Multi-pass loop: ~15 lines
  • Multiple OR+sublink conjunct handling: ~50 lines
  • De Morgan normalization: ~80 lines
  • Combinatorial guard: ~15 lines
  • Tests: ~250 lines
  • Risk: Low–Medium — each transform is well-defined; the main risk is SQL round-trip fidelity through deparse → rewrite → reparse cycles.

4. Implementation Order

Recommended order based on user impact and dependency:

Phase Gap Rationale
Phase 1 G1: User-defined aggregates Highest real-world impact (PostGIS, analytics). Lowest risk — reuses proven group-rescan.
Phase 2 G2: Window-in-expression Common in ranking/pagination queries. Medium risk — requires careful expression extraction.
Phase 3 G3: Nested OR+sublinks Lowest frequency. Some transforms (multi-pass) are simple; De Morgan is more involved.

Each phase is independently shippable and testable.


5. Verification Criteria

After all three gaps are closed, the following should hold:

  1. No silent FULL fallback for supported patterns: Every pattern listed in the “Status → ✓” column in each gap section should produce DIFFERENTIAL mode when AUTO is requested.

  2. pgtrickle.describe_stream_table() reports refresh_mode = 'DIFFERENTIAL' for queries that previously fell back to FULL.

  3. Zero regressions: All existing E2E tests pass unchanged.

  4. Performance: Group-rescan UDAs should have comparable performance to existing group-rescan aggregates (STRING_AGG, ARRAY_AGG). Window rewrite adds one subquery layer but no additional scans.

  5. Error messages: When a UDA doesn’t exist at creation time, or when OR+sublink expansion exceeds the branch limit, clear error messages guide the user.


6. Open Questions

  1. UDA FILTER/ORDER BY preservation: The existing group-rescan SQL rendering for built-in aggregates preserves FILTER and ORDER BY clauses. Need to verify the rendering code path is generic enough for arbitrary function names, or whether it hard-codes built-in names.

  2. Window rewrite + DISTINCT: If the original query has SELECT DISTINCT, should DISTINCT move to the outer query only, or should the inner subquery also deduplicate? (Answer: outer only — the inner subquery must preserve all rows for correct window computation.)

  3. De Morgan depth limit: How deep should NOT push-down recurse? Suggest depth limit of 3, matching the UNION rewrite pass limit.

  4. UDA in HAVING clause: HAVING my_agg(x) > 10 — the group-rescan strategy recomputes the full GROUP BY expression. Verify that HAVING conditions are included in the recomputation query.

  5. UDA with DISTINCT inside: my_agg(DISTINCT x) — verify the DISTINCT keyword is preserved in the rendered group-rescan SQL.