TPC-H Performance Comparison: ORCA vs PostgreSQL Native Planner

TPC-H Performance Comparison: ORCA vs PostgreSQL Native Planner

Test data from test/results/tpch_orca.out (pg_orca) vs test/expected/tpch_pg.out (stock PG planner), both run on the same TPC-H SF=1 dataset with parallelism disabled.

Execution Time

Query	PG Exec (ms)	ORCA Exec (ms)	Ratio (ORCA/PG)	Winner
Q1	10,394	10,781	1.04x	≈
Q2	517	868	1.68x	PG
Q3	2,953	3,120	1.06x	≈
Q4	1,184	4,330	3.66x	PG
Q5	1,220	2,785	2.28x	PG
Q6	2,238	2,209	0.99x	≈
Q7	2,014	2,946	1.46x	PG
Q8	906	940	1.04x	≈
Q9	8,650	4,355	0.50x	ORCA 2x
Q10	2,933	2,920	1.00x	≈
Q11	548	672	1.23x ✓ (fixed)	≈
Q12	3,616	3,421	0.95x	≈
Q13	1,696	1,766	1.04x	≈
Q14	2,235	2,180	0.98x	≈
Q15	2,232	2,219	0.99x	≈
Q16	1,137	626	0.55x	ORCA 1.8x
Q17	1,843	2,168	1.18x ✓ (fixed)	≈
Q18	8,228	9,211	1.12x	≈
Q19	151	158	1.04x	≈
Q20	761	3,454	4.54x	PG
Q21	3,829	9,670	2.53x	PG
Q22	649	1,064	1.64x	PG

Total (22 queries): PG ~59,935 ms vs ORCA ~71,863 ms — ORCA is 1.20x slower overall.

Key Findings

ORCA wins: - Q9: 2x faster — better join order for the 8-way join with correlated subquery - Q16: 1.8x faster — better plan for the NOT IN anti-join with part/supplier

Remaining regressions: - Q20 (4.5x): subquery decorrelation gone wrong — see analysis below - Q4 (3.7x): EXISTS subquery unnesting — see analysis below - Q5 (2.3x): multi-table join chain, join order suboptimal - Q21 (2.5x): complex multi-join with semi/anti-join - Q2 (1.7x): join order / index usage

Planning overhead: - ORCA planning is 50x–1150x slower than PG due to exhaustive search. On short-running queries this overhead is a significant fraction of total wall time.

Regression Analysis

Q11 — Fixed ✓

Commit: 46b96a6 cost: penalize index scan on non-leading composite key column

Root cause: partsupp_pkey is (ps_partkey, ps_suppkey). ORCA generated a Nested Loop where the inner side was Index Scan using partsupp_pkey with condition ps_suppkey = supplier.s_suppkey — using the non-leading column. Because the B-tree is ordered by ps_partkey first, this required scanning almost the entire index (~2,196 pages) per lookup. ORCA’s cost model treated it identically to a leading-column scan, underestimating the cost by ~50,000x.

Fix: In CostIndexScan() (libgpdbcost/src/CCostModelGPDB.cpp), when the predicate does not reference the first key column of a composite index, scale dIndexScanTupRandomFactor by table_pages / rows_per_rebind. This reflects that O(table_pages) index entries must be read rather than a small range, making the seq-scan + hash-join alternative cheaper.

Result: Q11 execution time 11,281 ms → 672 ms (16.8x improvement). The 2014 TODO comment in CostIndexScan that requested exactly this logic was removed.

Also resolved as a side effect: Q17 (previously commented out as unsupported) now runs correctly at 2,168 ms (1.18x vs PG).

Q20 — Open

Symptom: 3,454 ms vs PG 761 ms (4.5x slower). work_mem increase to 256 MB eliminates disk spill but gives no improvement (3,618 ms) — disk spill was not the bottleneck.

Root cause: Q20 contains a correlated scalar subquery in a HAVING-style filter:

ps_availqty > (SELECT 0.5 * sum(l_quantity) FROM lineitem
               WHERE l_partkey = ps_partkey AND l_suppkey = ps_suppkey
               AND l_shipdate BETWEEN '1994-01-01' AND '1995-01-01')

ORCA eagerly decorrelates this into a full HashAggregate(l_partkey, l_suppkey) over all of lineitem before joining with the outer query. PG keeps it as a correlated SubPlan and evaluates it lazily only for the 8,508 qualifying (ps_partkey, ps_suppkey) pairs.

	PG (lazy SubPlan)	ORCA (eager decorrelation)
lineitem rows processed	8,508 × ~1.1 rows via Bitmap Index Scan	909,455 rows (full year scan)
Aggregate groups	8,508 executions	543,210 groups (64x more)
Disk spill	none	Batches=5, 11,872 KB (at 64 MB work_mem)
ORCA cardinality estimate for Hash Join	—	`rows=1` (actual 5,833 — wildly wrong)

The decorrelated plan processes 64x more lineitem data. ORCA’s cardinality estimate for the join result (rows=1, actual 5,833) masked the true cost and led to selecting this path.

Fix direction: The subquery unnesting transform should compare the expected size of the decorrelated intermediate result vs. the cost of correlated execution (outer rows × subplan cost). When outer_qualifying_rows << decorrelated_result_rows, keeping the SubPlan wins.

Q4 — Open

Symptom: 4,330 ms vs PG 1,184 ms (3.66x slower).

Root cause: Q4 filters orders by date range and uses EXISTS to require at least one late-delivery lineitem:

EXISTS (SELECT * FROM lineitem
        WHERE l_orderkey = o_orderkey AND l_commitdate < l_receiptdate)

ORCA unnests EXISTS into a full deduplication + hash join:

SeqScan lineitem (l_commitdate < l_receiptdate) → 3,793,296 rows
  HashAggregate(l_orderkey) → 1,375,365 distinct keys   (88 MB hash table)
    Hash Join with orders (57,218 rows)

PG uses a Nested Loop Semi Join with lineitem_pkey(l_orderkey, l_linenumber):

SeqScan orders → 57,218 rows
  Index Scan lineitem_pkey (l_orderkey = o_orderkey, filter l_commitdate < l_receiptdate)
    57,218 lookups × ~0.92 rows (semi-join stops at first match)

	PG (NL Semi Join + index)	ORCA (full dedup + hash join)
lineitem rows accessed	~114K (57K × ~2 via index)	3,793,296 (full scan)
Deduplication	none (semi-join semantics)	HashAggregate, 1.375M groups, 88 MB
Index used	`lineitem_pkey` leading col `l_orderkey` ✓	none
ORCA row estimate	—	HashAggregate `rows=437,556` (actual 1,375,365, 3.1x off)

ORCA scans 33x more lineitem data. The cardinality underestimate (3.1x) hides the true hash table size and causes ORCA to prefer this plan.

Pattern shared with Q20 and Q21: Q4, Q20, and Q21 all involve ORCA eagerly unnesting a subquery into a full-scan + aggregate/dedup or Hash Semi/Anti Join before applying outer filters, while PG’s correlated or semi-join execution exploits available indexes and applies filters early. The specific cause for Q4/Q21 is structural: ORCA lacks EopPhysicalLeftSemiIndexNLJoin — see Q21 analysis for detail.

Fix direction: The EXISTS unnesting transform (CXformExistSubq2Join or equivalent) should estimate the cost of keeping the correlated semi-join path. When the outer driving table is small and an index covers the inner join key as a leading column, the nested-loop semi-join is likely cheaper than full-scan + dedup.

Q21 — Open

Symptom: 9,670 ms vs PG 3,829 ms (2.53x slower).

Query structure: Q21 finds suppliers who had waiting shipments. It joins supplier, lineitem l1 (late-delivery filter), orders, and uses: - NOT EXISTS on l3 (same order, different supplier, also late) — Anti Join - EXISTS on l2 (same order, different supplier) — Semi Join

ORCA plan:

Hash Join l1+supplier+nation ⋈ orders (729K-row hash, no spill)
  Hash Semi Join l1 ⋈ l2 (EXISTS, l_orderkey)
    Hash Anti Join l1 ⋈ l3 (NOT EXISTS, l_orderkey)
      Hash Join l1 ⋈ supplier ⋈ nation  →  156,739 rows
        Seq Scan l1 (l_receiptdate > l_commitdate)  →  3,793,296 rows
    Hash l3: Seq Scan lineitem (3,793,296 rows) — Batches=2, disk spill 6,957 pages
  Hash l2: Seq Scan lineitem (6,001,215 rows) — Batches=4, disk spill 15,376 pages

PG plan:

Nested Loop → orders (Index Scan orders_pkey, 8,357 lookups)
  Nested Loop Semi Join → l2 (Index Scan lineitem_pkey, 13,859 lookups × ~0.6 rows)
    Nested Loop Anti Join → l3 (Index Scan lineitem_pkey, 156,739 lookups × ~0.9 rows)
      Hash Join l1 ⋈ supplier ⋈ nation  →  156,739 rows
        Seq Scan l1 (l_receiptdate > l_commitdate)  →  3,793,296 rows

Data volume comparison:

	PG (NL + index)	ORCA (Hash + full scan)
l3 rows read	~142K (156,739 × 0.91 via index)	3,793,296 (full scan + disk spill)
l2 rows read	~8,300 (13,859 × 0.60 via index)	6,001,215 (full scan + disk spill)
orders rows	8,357 (index lookups)	729,413 (full hash table)
Disk spill	none	l3: 6,957 pages; l2: 15,376 pages

ORCA reads 26x more l3 data and 721x more l2 data than PG.

Root cause: missing Semi/Anti-Semi Index NL Join operators

ORCA’s physical operator set includes:

Operator	Exists?
`EopPhysicalInnerIndexNLJoin`	yes
`EopPhysicalLeftOuterIndexNLJoin`	yes
`EopPhysicalLeftSemiIndexNLJoin`	no
`EopPhysicalLeftAntiSemiIndexNLJoin`	no

CXformImplementIndexApply (the transform that instantiates physical Index NL Join operators) only produces CPhysicalInnerIndexNLJoin and CPhysicalLeftOuterIndexNLJoin. It has no branches for semi or anti-semi variants.

Because these operators don’t exist, ORCA can never use an index to probe the inner side of an EXISTS / NOT EXISTS subquery. The only available physical alternatives are CPhysicalLeftSemiHashJoin and CPhysicalLeftAntiSemiHashJoin, both of which require materializing the entire inner relation into a hash table — hence the full scan of lineitem (6M rows for l2, 3.8M for l3) and the resulting disk spill.

Pattern shared with Q4: Q4’s EXISTS regression has the identical structural cause: the correlated EXISTS subquery is unnested into a full scan + dedup rather than an indexed semi-join, for the same reason — no EopPhysicalLeftSemiIndexNLJoin exists to exploit lineitem_pkey.

Fix direction: Implement CPhysicalLeftSemiIndexNLJoin and CPhysicalLeftAntiSemiIndexNLJoin physical operators (mirroring the inner/outer variants), and extend CXformImplementIndexApply to instantiate them. This is a significant effort requiring new physical operator classes, cost model entries, and DXL translator support, but would address Q4, Q21, and any other query using EXISTS/NOT EXISTS with an indexed inner key.

PGXN

PostgreSQL Extension Network

Contents