GAP_SQL_PHASE_7 — Open Questions

GAP_SQL_PHASE_7 — Open Questions

Date: 2026-02-25 Source: GAP_SQL_PHASE_7.md Prioritized Implementation Roadmap

Tier 0 — Critical Correctness

Q1. F1 (G4.1) — DELETE+INSERT Strategy Fix

The DELETE+INSERT merge strategy evaluates the delta query twice (once for DELETE, once for INSERT). The DELETE mutates the stream table before the INSERT phase reads it, causing stale reads for aggregate/DISTINCT queries.

Alternative	Description	Rationale
(a) Reject at selection time	Block `delete_insert` strategy for queries where `needs_pgt_count() == true`	Simplest fix. Prevents the bug by disallowing the dangerous combination. Zero risk of regression.
(b) Single-evaluation CTE	`WITH delta AS MATERIALIZED (...)` then DELETE and INSERT both read from the same materialized result	More robust — preserves the strategy as an option. Both phases see identical data. Moderate complexity.
© Remove strategy entirely	Drop the `delete_insert` GUC option; always use MERGE	Eliminates the entire bug class. Simplest long-term. But removes a fallback for edge cases where MERGE has issues (e.g., PG versions with MERGE bugs).

Preferred: (a) Reject at selection time. It’s the smallest, safest change. The strategy is already gated behind an explicit GUC that defaults to auto (MERGE), so very few users are affected. If delete_insert is later needed for non-aggregate queries, it remains available.

Q2. F5 (G1.1) — JOIN Key Change with Simultaneous Right-Side Delete

When a row’s join key is updated in the same refresh cycle as the old join partner is deleted, the delta query reads current_right (post-change state) and fails to produce the correct DELETE for the stale join result.

Alternative	Description	Rationale
(A) CTE snapshot	Read the right side at the frontier LSN using `pg_snapshot_xmin()`	Highest correctness. But requires SERIALIZABLE or snapshot export — complex and may reduce concurrency.
(B) Dual-phase delta	Compute DELETE delta using old state, INSERT delta using new state	Architecturally invasive. Doubles query cost. Would require deep changes to all join diff operators.
© Compensating anti-join	After MERGE, detect orphaned rows whose join partner no longer exists and emit corrective DELETEs	Moderate complexity. Adds a post-MERGE cleanup pass. Correct but increases refresh time slightly.
(D) Document + FULL fallback	Document the edge case. Rely on the existing adaptive threshold to trigger FULL refresh when large batches of key-modifying UPDATEs occur	Pragmatic. No code change. Relies on existing fallback mechanism. Risk: small batches of key changes may not trigger the threshold.

Preferred: (D) Document + FULL fallback for now. The scenario requires simultaneous key change + related row deletion in the same refresh cycle, which is uncommon. The adaptive FULL fallback already handles large mutation batches. Option C is the right follow-up if customer reports surface real-world impact.

Q3. F2/F3 (G2.1, G2.2) — WAL Decoder Requirements

The WAL decoder has two P1 issues: keyless tables get pk_hash = 0 (instead of content hash), and UPDATE events have old_* columns always NULL.

Alternative	Description	Rationale
(a) Content hashing in WAL decoder	Implement all-column content hashing matching trigger behavior	Full compatibility with trigger-based CDC. Higher implementation effort (~8–12h for old_* columns).
(b) Require PRIMARY KEY for WAL mode	Reject keyless tables when WAL CDC is active	Simpler. Avoids the hash mismatch problem entirely. Limits WAL mode to well-designed schemas.
© Require REPLICA IDENTITY FULL	For old_* column support, require `ALTER TABLE ... REPLICA IDENTITY FULL` on source tables	PostgreSQL’s native solution. Ensures old tuple data is available in pgoutput. Storage/performance overhead on the source table.

Preferred: (b) Require PRIMARY KEY for WAL mode for pk_hash, combined with © Require REPLICA IDENTITY FULL for old_* columns. WAL mode is an opt-in performance optimization — requiring PK + REPLICA IDENTITY FULL is a reasonable trade-off. Document clearly in the WAL mode prerequisites.

Tier 1 — Verification

Q4. F8 (G1.2) — Window Partition Key Change Verification

When a row’s PARTITION BY column is updated, the row moves between partitions. The scan emits DELETE(old) + INSERT(new), which should trigger recomputation of both the old and new partitions.

Alternative	Description	Rationale
(a) Test-first, then decide	Write targeted E2E test. If it passes → downgrade to P4. If it fails → implement fix.	Efficient — avoids spending 4–6h on a fix that may not be needed. The partition-based recomputation may already handle this correctly.
(b) Fix proactively	Implement explicit dual-partition recomputation without waiting for test results	Guarantees correctness but may be unnecessary work if it already works.

Preferred: (a) Test-first, then decide. The existing partition recomputation logic may already handle this correctly through the DELETE+INSERT split. A 2-hour E2E test resolves the question definitively.

Q5. F9 (G1.3) — Recursive CTE Monotonicity Fallback

Semi-naive evaluation is incorrect for non-monotone recursive terms (those containing EXCEPT, NOT EXISTS, or aggregation). The system must detect non-monotone terms and choose an appropriate strategy.

Alternative	Description	Rationale
(a) Always recomputation for non-monotone	Detect non-monotone operators in the recursive term and force full recomputation	Safe and simple. Recomputation is always correct. May be slower for large recursive results but guarantees correctness.
(b) Use DRed for some non-monotone cases	Attempt DRed (delete-and-rederive) for stratifiable non-monotone terms	DRed can handle some non-monotone cases correctly (e.g., stratified negation). More efficient when applicable but complex to implement the stratification check.

Preferred: (a) Always recomputation for non-monotone. Non-monotone recursive CTEs are rare in practice. The correctness guarantee outweighs the performance cost. DRed optimization can be added later if profiling shows recursive CTEs are a bottleneck.

Q6. F11 (G7.1) — Keyless Table Duplicate Row Handling

Tables without a PRIMARY KEY use content hashing for row identity. Two identical rows produce the same hash, causing incorrect delta behavior.

Alternative	Description	Rationale
(a) Document only	Add prominent documentation about the limitation	Minimal effort. Users with keyless tables and duplicate rows are warned.
(b) Document + WARNING at creation	Emit a `WARNING` when `create_stream_table()` detects a source table with no PK/UNIQUE constraint	Proactive user notification. Low effort (1 extra SPI check). Prevents surprise failures.
© Document + WARNING + ctid tiebreaker	Use `ctid` as a secondary hash input to distinguish identical rows	Most correct, but `ctid` changes on VACUUM and UPDATE, so the tiebreaker itself is unstable. Creates a different class of bugs.

Preferred: (b) Document + WARNING at creation. A WARNING at stream table creation time is low-cost and catches the problem early. Users can then add a PK or UNIQUE constraint. Avoid ctid — its instability creates worse problems than it solves.

Q7. F12 (G8.1) — PgBouncer Compatibility

PgBouncer in transaction-mode pooling doesn’t support session-level advisory locks, prepared statements, or LISTEN/NOTIFY. The scheduler uses advisory locks for concurrency control.

Alternative	Description	Rationale
(a) Transaction-scoped advisory locks	Replace `pg_advisory_lock()` with `pg_advisory_xact_lock()`	PgBouncer-safe. Lock is held only for the transaction duration. Simple drop-in replacement. Risk: lock is released between transactions within a refresh cycle.
(b) Row-level locking on catalog	Use `SELECT ... FOR UPDATE SKIP LOCKED` on `pgt_stream_tables` rows	PgBouncer-safe. No advisory locks needed. Natural fit — each ST “claims” its own catalog row. Concurrent refreshes skip already-locked STs.
© Both + documentation	Implement (b) as primary, keep (a) as fallback, document PgBouncer guidance	Most robust. Two independent concurrency mechanisms. More code to maintain.

Preferred: (b) Row-level locking on catalog. FOR UPDATE SKIP LOCKED is the cleanest solution — it’s PgBouncer-safe, doesn’t require advisory lock IDs, and naturally maps to the per-ST refresh model. It also enables future parallel refresh (multiple workers each lock different ST rows).

Tier 2 — Robustness

Q8. F13 (G4.2) — LIMIT in Subquery Without ORDER BY

LIMIT in a FROM subquery or lateral subquery without ORDER BY produces non-deterministic results — full and differential refresh may pick different rows.

Alternative	Description	Rationale
(a) WARNING	Allow but emit a WARNING about non-determinism	Permissive. `LIMIT` + `ORDER BY` in laterals is a common, valid pattern. Only the missing `ORDER BY` case is problematic.
(b) ERROR	Reject `LIMIT` in subqueries without `ORDER BY` entirely	Strictest. Prevents non-determinism. May block valid use cases where the user doesn’t care about which row is picked.
© ERROR for DIFF, allow for FULL-only	Reject in DIFFERENTIAL mode (where non-determinism causes drift), allow in FULL-only mode	Most nuanced. Correctly identifies that the issue is differential divergence, not the query itself. More complex logic.

Preferred: (a) WARNING. Most PostgreSQL users expect LIMIT in laterals to work. A WARNING surfaces the risk without blocking legitimate patterns. Users who need determinism will add ORDER BY; users doing LATERAL (... LIMIT 1) for “any row” semantics can proceed with awareness.

Q9. F14 (G5.2) — CUBE Combinatorial Explosion Limit

CUBE(a, b, c, ..., n) generates $2ⁿ$ grouping sets, each becoming a UNION ALL branch. Large CUBEs can exhaust memory.

Alternative	Description	Rationale
(a) Hardcoded limit (64)	Reject if $2ⁿ > 64$	Simple. 64 branches covers CUBE on up to 6 columns, which handles virtually all real-world cases.
(b) Hardcoded limit (128)	Reject if $2ⁿ > 128$	More permissive. CUBE on 7 columns. Still safe from OOM.
© GUC-configurable limit	`pg_trickle.max_grouping_set_branches` with default 64	Most flexible — power users can raise it. Adds a GUC.

Preferred: (a) Hardcoded limit of 64. CUBE on 6+ columns is exceedingly rare and likely a mistake. A hardcoded limit avoids GUC proliferation. The error message should suggest explicit GROUPING SETS(...) as an alternative.

Tier 3 — Test Coverage

Q10. F17–F26 — Test File Organization

10 test tasks cover untested operators, auto-rewrites, GUC variations, and multi-cycle refresh patterns.

Alternative	Description	Rationale
(a) Single new file	All new tests in `e2e_aggregate_coverage_tests.rs` (aggregates) + additions to existing files for other operators	Simple organization. Aggregates are the bulk of the work (21 variants). Other tests fit in existing files (`e2e_create_tests.rs`, `e2e_refresh_tests.rs`).
(b) Per-operator new files	New files: `e2e_full_join_tests.rs`, `e2e_set_operation_tests.rs`, `e2e_scalar_subquery_tests.rs`, etc.	Cleaner separation. Each file is focused. Easier to run individual operator suites. Adds 5-6 new test files.
© Hybrid	One new file for aggregate batch (`e2e_aggregate_coverage_tests.rs`), add other tests to existing files where they fit, new files only for completely unrepresented operators	Balanced. Keeps the test directory manageable while maintaining logical grouping.

Preferred: © Hybrid. The 21 aggregate tests warrant their own file. FULL JOIN tests fit in e2e_create_tests.rs or a new e2e_join_tests.rs. INTERSECT/EXCEPT fit in a new e2e_set_operation_tests.rs. GUC and multi-cycle tests fit in e2e_refresh_tests.rs. This avoids both a monolithic test file and excessive file proliferation.

Cross-Cutting / Scope

Q11. WAL Mode for 1.0

The WAL decoder has three P1 issues (F2, F3, F4). The trigger-based CDC is production-ready.

Alternative	Description	Rationale
(a) Trigger-only for 1.0, WAL experimental	Ship 1.0 with trigger CDC only. Mark WAL mode as experimental/unsupported. Fix F4 (parsing, 2-3h) as hygiene. Defer F2+F3 (~14h).	Fastest to 1.0. WAL mode is opt-in behind a GUC already. No user expects WAL mode in a first release.
(b) WAL production-ready for 1.0	F2+F3+F4 (~14–21h) are 1.0 blockers. WAL mode fully supported at launch.	Higher effort. Gives users a migration path to lower-overhead CDC from day one.
© Remove WAL code for 1.0	Strip WAL decoder from the release binary. Reintroduce when ready.	Cleanest release. No dead code. But loses the existing implementation work and makes re-integration harder.

Preferred: (a) Trigger-only for 1.0, WAL experimental. The trigger-based CDC is solid and sufficient for 1.0. WAL mode can be promoted in a 1.1 release after F2+F3 are implemented and battle-tested. Fix F4 (parsing) regardless — it’s a 2-3h hygiene fix that prevents misclassification even in experimental mode.

Q12. 1.0 Release Scope

How much of the roadmap must be completed before 1.0?

Alternative	Tier Coverage	Effort	What It Includes
(a) Tiers 0+1 only	P0/P1 fixes	~39–57h	All correctness issues closed. Test gaps remain — some operators are unverified under real PG execution.
(b) Tiers 0+1+3	P0/P1 + test coverage	~70–95h	Correctness + comprehensive E2E validation. High confidence in release quality.
© Tiers 0+1+2+3	Through robustness	~75–104h	Adds P2 fixes (LIMIT warning, CUBE limit, RANGE_AGG, replica detection). Marginal effort over (b).
(d) Tiers 0–4	Through operational	~100–140h	Full production hardening. Monitoring, retry logic, PgBouncer docs, upgrade paths.

Preferred: © Tiers 0+1+2+3. The P2 items (F13, F14, F15, F16) add only ~7–9h over the test coverage scope but prevent crashes and confusing errors. Tier 4 (operational hardening) can follow in a 1.0.x patch cycle. This gives a release that’s correct, well-tested, and robust — ~75–104h across ~7 sessions.

Q13. Session Execution Order

The recommended plan runs 9 sessions, front-loading non-WAL fixes (Sessions 1–6) and deferring WAL decoder work to Session 7.

Alternative	Description	Rationale
(a) As proposed	Sessions 1–6 focus on trigger-based correctness and tests. Session 7 tackles WAL. Sessions 8–9 are operational hardening.	Maximizes value for the default (trigger) path first. WAL fixes only matter when WAL mode is enabled.
(b) WAL earlier (Session 3–4)	Move F2+F3+F4 to Session 3, shift test coverage to Sessions 5–6	Gets WAL mode production-ready sooner. Useful if WAL is a 1.0 requirement.
© Test coverage earlier	Move F17–F26 (test coverage) to Sessions 3–4, before DDL tracking fixes	Tests may surface unknown bugs in existing operators, informing whether DDL/WAL fixes need reprioritization.

Preferred: (a) As proposed, with one modification: move a small batch of high-value tests (F18: FULL JOIN, F19: INTERSECT/EXCEPT) into Session 2 alongside the DDL tracking work. These operators have 400+ lines of untested delta SQL and testing them early could surface P1 bugs before the codebase grows further.

PGXN

PostgreSQL Extension Network

Contents