PLAN: SQL Gaps — Phase 4

PLAN: SQL Gaps — Phase 4

Status: Complete
Date: 2026-02-22 Branch: main Scope: Full evaluation of SQL gap status, GAP_SQL_OVERVIEW.md accuracy audit, and prioritized remaining work. Current state: 826 unit tests, 21 E2E test files, 25 AggFunc variants, 21 OpTree variants, 20 diff operators

Evaluation of Previous Plans

PLAN_SQL_GAPS_1.md — ✅ COMPLETE (6/6 tasks)

Task	Description	Status
1	Window-in-subexpression recursive detection (14 node types)	✅
2	GAP_SQL_OVERVIEW.md discrepancy cleanup	✅
3	Stale comment fix on `check_from_item_unsupported()`	✅
4	CROSS JOIN test coverage (3 unit + 4 E2E)	✅
5	FOR UPDATE/FOR SHARE rejection (4 E2E)	✅
6	Documentation updates (README, SQL_REFERENCE)	✅

No remaining work.

PLAN_SQL_GAPS_2.md — ✅ COMPLETE (6/6 tasks)

Task	Description	Status
1	GROUPING SETS silent-skip P0 bug fix	✅
2	GROUP BY `if let Ok` → `?` error propagation hardening	✅
3	TABLESAMPLE explicit rejection	✅
4	GAP_SQL_OVERVIEW.md Gap 6.2 correction	✅
5	Aggregate rejection E2E test coverage	✅
6	Documentation updates	✅

No remaining work.

PLAN_SQL_GAPS_3.md — ✅ COMPLETE (10/10 tasks)

Task	Description	Status
1	BIT_AND/OR/XOR AggFunc variants	✅
2	JSON_OBJECT_AGG / JSONB_OBJECT_AGG	✅
3	Documentation for Session 1	✅
4	Statistical aggregates (STDDEV_POP/SAMP, VAR_POP/SAMP)	✅
5	Documentation for Session 2	✅
6	EXISTS subquery — OpTree + parser (SemiJoin, AntiJoin)	✅
7	EXISTS subquery — delta operators (semi_join.rs, anti_join.rs)	✅
8	IN (SELECT …) subquery parsing	✅
9	Scalar subquery (ScalarSubquery OpTree + operator)	✅
10	Documentation for Sessions 3–5	✅

Sessions 6–8 (Tier 3 structural enhancements) were listed as deferred. These are the remaining work.

GAP_SQL_OVERVIEW.md — Accuracy Audit

🔴 Critical Inaccuracies Found

The report has not been updated to reflect PLAN_SQL_GAPS_3 completion. Multiple sections are stale:

Section	Report Claims	Actual State	Severity
Last Updated (L6)	“PLAN_SQL_GAPS_2 Tasks 1-3 complete”	PLAN_SQL_GAPS_3 Tasks 1-10 also complete	🔴 Stale
Executive summary (L31-35)	“Remaining gaps include: correlated subquery expressions (EXISTS, IN, scalar subqueries)”	EXISTS, IN, NOT EXISTS, NOT IN, scalar subquery are ALL implemented	🔴 Wrong
Test count (L37)	“783 unit tests passing”	809 unit tests passing	🟡 Outdated
Gap 4 intro (L403-405)	“No subquery expression types produce valid SQL”	EXISTS, IN, scalar subquery all produce valid delta SQL	🔴 Wrong
Gap 4.1 (L407)	“✅ FIXED (REJECTED)”	✅ IMPLEMENTED — ScalarSubquery OpTree + operator	🔴 Wrong
Gap 4.2 (L419)	“✅ FIXED (REJECTED)”	✅ IMPLEMENTED — SemiJoin / AntiJoin OpTree + operators	🔴 Wrong
Gap 4.3 (L429)	“✅ FIXED (REJECTED)”	✅ IMPLEMENTED — SemiJoin / AntiJoin + equality condition	🔴 Wrong
Gap 4.4 (L439)	“✅ FIXED (REJECTED)”	ANY_SUBLINK → IMPLEMENTED; ALL_SUBLINK → still rejected	🟡 Partially wrong
Phase B section	“B1–B3 subqueries remain… 4-6 sessions estimated”	B1–B3 are complete	🔴 Wrong
Recommended Next Step	“Continue with Phase B1 (EXISTS subqueries)”	B1-B3 done. Next should be Tier 3	🔴 Wrong
“What Works Well” table	No subquery WHERE rows	Missing SemiJoin/AntiJoin/ScalarSubquery entries	🟡 Incomplete

✅ Accurate Sections

Section	Status
Gap 1 (Expressions 1.1–1.11)	✅ All accurate
Gap 2 (Joins 2.1–2.5)	✅ All accurate
Gap 3 (Aggregation 3.1–3.4)	✅ Accurate (Phase 1+2+3 done, remaining listed correctly)
Gap 5 (LATERAL)	✅ Accurate
Gap 6 (Clause features 6.1–6.8)	✅ All accurate
Gap 7 (Window 7.1–7.4)	✅ All accurate
Gap 8 (Data types 8.1–8.6)	✅ All accurate
Gap 9 (Documentation 9.1–9.3)	✅ Accurate
Priority 1–4 sections	✅ Accurate
Phase A, Phase C	✅ Accurate

Current Codebase State (Ground Truth)

Supported Features (working in DIFFERENTIAL mode)

Category	Feature	OpTree / Strategy
Core	Scan, Project, Filter	Scan, Project, Filter
Joins	INNER, LEFT, RIGHT, FULL, CROSS, nested 3+	InnerJoin, LeftJoin, FullJoin
Aggregation	22 AggFunc variants (COUNT/SUM/AVG/MIN/MAX + 17 group-rescan)	Aggregate
Dedup	DISTINCT	Distinct
Set ops	UNION ALL, UNION, INTERSECT [ALL], EXCEPT [ALL]	UnionAll, Intersect, Except
Subqueries	FROM subquery	Subquery
Subqueries	EXISTS / NOT EXISTS in WHERE	SemiJoin / AntiJoin
Subqueries	IN / NOT IN (subquery) in WHERE	SemiJoin / AntiJoin
Subqueries	Scalar subquery in SELECT list	ScalarSubquery
CTEs	Non-recursive WITH (single + multi-ref)	CteScan
CTEs	WITH RECURSIVE (FULL mode, recomputation in DIFF)	RecursiveCte
Window	Window functions (PARTITION BY, ORDER BY, frames, named)	Window
LATERAL	SRFs (unnest, jsonb_array_elements, etc.)	LateralFunction
LATERAL	Correlated subquery in FROM	LateralSubquery
Expressions	30+ expression types (CASE, COALESCE, IN list, BETWEEN, etc.)	Expr variants

Explicitly Rejected Constructs (clear error messages)

Construct	Error Behavior	Applies To
ORDER BY + LIMIT (TopK)	✅ Supported — scoped recomputation via MERGE	All modes
LIMIT without ORDER BY	Rejected — undefined ordering	All modes
OFFSET	Rejected — stream tables are full result sets	All modes
DISTINCT ON	Rejected → use DISTINCT or ROW_NUMBER()	All modes
NATURAL JOIN	Rejected → use explicit JOIN … ON	All modes
GROUPING SETS / CUBE / ROLLUP	Rejected → use separate STs + UNION ALL, or FULL mode	All modes
FOR UPDATE / FOR SHARE	Rejected — no row-level locking	All modes
TABLESAMPLE	Rejected — use WHERE random()	All modes
ALL (subquery)	Rejected → use NOT EXISTS	All modes
SubLinks inside OR	Rejected → use UNION or separate STs	DIFF mode
Scalar subquery in WHERE	Rejected → use JOIN or CTE	DIFF mode
Mixed UNION / UNION ALL	Rejected → use all UNION or all UNION ALL	All modes
ROWS FROM() with multiple functions	Rejected → use single SRF	All modes
Different PARTITION BY across window funcs	Rejected → must share same PARTITION BY	DIFF mode
Window functions nested in expressions	Rejected → move to separate column	All modes
LATERAL with RIGHT/FULL JOIN	Rejected → only INNER/LEFT supported	All modes
Recursive CTE in DIFFERENTIAL	Rejected → use FULL mode	DIFF mode
20 recognized-but-unsupported aggregates	Rejected → use FULL mode	DIFF mode

Aggregate Support Summary

Implemented (25)	Recognized but Rejected (17)
COUNT, COUNT(*), SUM, AVG, MIN, MAX	XMLAGG
BOOL_AND/EVERY, BOOL_OR	CORR, COVAR_POP, COVAR_SAMP
STRING_AGG, ARRAY_AGG	REGR_AVGX/AVGY/COUNT/INTERCEPT/R2/SLOPE/SXX/SXY/SYY
JSON_AGG, JSONB_AGG	RANK, DENSE_RANK, PERCENT_RANK, CUME_DIST (as aggregates)
BIT_AND, BIT_OR, BIT_XOR
JSON_OBJECT_AGG, JSONB_OBJECT_AGG
STDDEV_POP/STDDEV, STDDEV_SAMP
VAR_POP, VAR_SAMP/VARIANCE
MODE (ordered-set)
PERCENTILE_CONT (ordered-set)
PERCENTILE_DISC (ordered-set)

Remaining Gaps — Prioritized

All remaining items are P2 or lower — nothing is silently broken. Every unsupported construct is either working correctly or rejected with a clear, actionable error message.

Tier 1: Quick Wins (1–2 sessions, follow established patterns)

These use the proven group-rescan pattern — each is a copy-paste of an existing aggregate with minimal new logic.

ID	Construct	Strategy	Effort	Impact
A1	`MODE()` ordered-set aggregate	Group-rescan	✅ COMPLETE	Medium — used in analytics
A2	`PERCENTILE_CONT()`, `PERCENTILE_DISC()`	Group-rescan	✅ COMPLETE	Medium — used in reporting
A3	Regression aggregates: `CORR`, `COVAR_POP/SAMP`, `REGR_*` (11 functions)	Group-rescan	4–6 hours	Low — niche statistical use

A1 and A2 are complete: 3 new AggFunc variants (Mode, PercentileCont, PercentileDisc), WITHIN GROUP (ORDER BY) parsing, 17 new unit tests. 826 unit tests passing.

Note: 17 rejected aggregates remain (regression + hypothetical-set + XMLAGG). All can be implemented with group-rescan if demand arises.

Tier 2: Structural Features (2–4 sessions, significant new logic)

These require new OpTree variants or major parser changes — not just following existing patterns.

ID	Construct	Complexity	Effort	Impact
S1	GROUPING SETS / CUBE / ROLLUP (full impl)	High — each grouping set = separate aggregation	10–15 hours	Medium
S2	DISTINCT ON (full impl)	Medium — rewrite to ROW_NUMBER() OVER (…) = 1	6–8 hours	Medium
S3	Mixed UNION / UNION ALL	Medium — per-arm dedup flags	4–6 hours	Low
S4	Multiple PARTITION BY in one query	High — multiple recomputation passes	8–10 hours	Low
S5	ROWS FROM() with multiple functions	Low — zip SRF outputs	3–4 hours	Very low

Tier 3: Edge Cases (optional, low priority)

ID	Construct	Notes	Effort
E1	Scalar subquery in WHERE	Currently rejected; complex — would need value-change tracking per row	6–8 hours
E2	SubLinks inside OR conditions	Requires OR-to-UNION rewrite; tricky for delta correctness	8–10 hours
E3	ALL (subquery)	Dual of ANY — anti-join with universal quantification	4–6 hours
E4	NATURAL JOIN (full impl)	Needs catalog access to resolve column lists at parse time	6–8 hours
E5	Hypothetical-set aggregates (RANK, DENSE_RANK, PERCENT_RANK, CUME_DIST as aggregates)	Rare — almost always used as window functions, not aggregates	4–6 hours
E6	XMLAGG	Very niche	1–2 hours

GAP_SQL_OVERVIEW.md — Required Updates ✅ ALL COMPLETE

All corrections have been applied to GAP_SQL_OVERVIEW.md.

Task R1: Fix Gap 4 (Subquery) status ✅ COMPLETE

Applied: 1. Gap 4 intro: Updated from “No subquery expression types produce valid SQL” to implementation status 2. Gap 4.1: Changed “✅ FIXED (REJECTED)” → “✅ IMPLEMENTED” with ScalarSubquery details 3. Gap 4.2: Changed “✅ FIXED (REJECTED)” → “✅ IMPLEMENTED” with SemiJoin/AntiJoin details 4. Gap 4.3: Changed “✅ FIXED (REJECTED)” → “✅ IMPLEMENTED” with rewrite details 5. Gap 4.4: Changed to “✅ PARTIALLY IMPLEMENTED” — ANY implemented, ALL still rejected 6. “Fix Options” section replaced with Implementation Summary table

Task R2: Fix Executive Summary and metadata ✅ COMPLETE

Applied: 1. Last Updated: Now references PLAN_SQL_GAPS_3 Tasks 1-10 completion 2. Executive summary: Added subquery WHERE support to feature list 3. Removed subqueries from “Remaining gaps” 4. Test count updated to 809

Task R3: Fix Phase B and Recommended Next Step ✅ COMPLETE

Applied: 1. Phase B: B1–B3 marked as ✅ IMPLEMENTED with details 2. “Estimated effort” and “sessions remaining” removed — Phase B fully complete 3. Recommended Next Step: Points to GAP_SQL_PHASE_4.md for remaining work

Task R4: Update “What Works Well” table ✅ COMPLETE

Recommendations

What to Do Next

Option A: Fix GAP_SQL_OVERVIEW.md accuracy (Tasks R1-R4) — Recommended first - Effort: 1 hour - Rationale: The report is the canonical reference. Having it say “REJECTED” for features that are fully implemented is misleading. This should be fixed before any new feature work.

Option B: Ordered-set aggregates (A1-A2) - Effort: 5-7 hours (1 session) - Rationale: MODE, PERCENTILE_CONT, PERCENTILE_DISC are the most commonly requested remaining aggregates. Group-rescan pattern is proven. Users requesting percentile reporting cannot currently use DIFFERENTIAL mode.

Option C: DISTINCT ON full implementation (S2) - Effort: 6-8 hours (1 session) - Rationale: The rewrite-to-window-function approach is clean and well-understood. DISTINCT ON is a common PostgreSQL idiom for “first row per group” queries. Currently rejected.

Option D: Regression aggregates (A3) - Effort: 4-6 hours (1 session) - Rationale: 11 functions (CORR, COVAR*, REGR*) follow the same group-rescan pattern. Low individual demand but covers a large number of gap items at once.

Recommended Execution Order

Session 1: Tasks R1-R4 (REPORT fix) + A1-A2 (MODE + PERCENTILE)
Session 2: S2 (DISTINCT ON) or A3 (regression aggregates)
Session 3: S3 (Mixed UNION/UNION ALL)
Session 4+: S1 (GROUPING SETS) — only if specifically requested

What NOT to Prioritize

These items have low user demand and/or high complexity relative to their value:

Item	Why Defer
GROUPING SETS full impl (S1)	10-15 hours, needs new OpTree variant, rarely requested. Rejection with UNION ALL alternative is adequate.
Multiple PARTITION BY (S4)	8-10 hours, complex multi-pass recomputation. Edge case.
ROWS FROM multi-function (S5)	Very niche — almost never used.
NATURAL JOIN full impl (E4)	Needs catalog access at parse time. Rejection is appropriate — explicit JOINs are better practice.
SubLinks inside OR (E2)	OR-to-UNION rewrite is architecturally complex and error-prone.
Hypothetical-set aggregates (E5)	Almost always used as window functions, not aggregates. Rejection is fine.

Success Criteria

After this plan’s recommended work (Sessions 1-2):

[x] GAP_SQL_OVERVIEW.md is 100% accurate — no stale claims ✅ DONE (Option A, Tasks R1-R4)
[x] 25 AggFunc variants (up from 22): added MODE, PERCENTILE_CONT, PERCENTILE_DISC ✅ DONE
[ ] OR: DISTINCT ON works via window-function rewrite
[x] 826 unit tests (target was 825+) ✅ DONE
[ ] Documentation fully consistent across README, SQL_REFERENCE, DVM_OPERATORS, REPORT_SQL_GAPS

Historical Progress Summary

Plan	Sessions	Items Resolved	Test Growth
PLAN_SQL_GAPS_1	1	Window detection, CROSS JOIN tests, FOR UPDATE rejection, report cleanup	745 → 757
PLAN_SQL_GAPS_2	1	GROUPING SETS P0 fix, GROUP BY hardening, TABLESAMPLE rejection	745 → 750
PLAN_SQL_GAPS_3	~5	5 new aggregates + 3 subquery operators (SemiJoin, AntiJoin, ScalarSubquery)	750 → 809
PLAN_SQL_GAPS_4	1	Report accuracy fix (R1-R4) + 3 ordered-set aggregates (A1-A2)	809 → 826
Total	~8	47+ of 52 original gaps resolved	~700 → 826

Remaining: 8 gap items remain as intentional rejections (GROUPING SETS, DISTINCT ON, Mixed UNION, Multiple PARTITION BY, ROWS FROM, NATURAL JOIN) + 20 aggregate functions recognized but rejected. All have clear error messages with actionable rewrite suggestions. Zero P0 or P1 issues remain.

PGXN

PostgreSQL Extension Network

Contents