Contents
TPC-H Performance Comparison: ORCA vs PostgreSQL Native Planner
Test data from test/results/tpch_orca.out (pg_orca) vs test/expected/tpch_pg.out (stock PG
planner), both run on the same TPC-H SF=1 dataset with parallelism disabled.
Execution Time
| Query | PG Exec (ms) | ORCA Exec (ms) | Ratio (ORCA/PG) | Winner |
|---|---|---|---|---|
| Q1 | 10,394 | 10,781 | 1.04x | ≈ |
| Q2 | 517 | 868 | 1.68x | PG |
| Q3 | 2,953 | 3,120 | 1.06x | ≈ |
| Q4 | 1,184 | 4,330 | 3.66x | PG |
| Q5 | 1,220 | 2,785 | 2.28x | PG |
| Q6 | 2,238 | 2,209 | 0.99x | ≈ |
| Q7 | 2,014 | 2,946 | 1.46x | PG |
| Q8 | 906 | 940 | 1.04x | ≈ |
| Q9 | 8,650 | 4,355 | 0.50x | ORCA 2x |
| Q10 | 2,933 | 2,920 | 1.00x | ≈ |
| Q11 | 548 | 672 | 1.23x ✓ (fixed) | ≈ |
| Q12 | 3,616 | 3,421 | 0.95x | ≈ |
| Q13 | 1,696 | 1,766 | 1.04x | ≈ |
| Q14 | 2,235 | 2,180 | 0.98x | ≈ |
| Q15 | 2,232 | 2,219 | 0.99x | ≈ |
| Q16 | 1,137 | 626 | 0.55x | ORCA 1.8x |
| Q17 | 1,843 | 2,168 | 1.18x ✓ (fixed) | ≈ |
| Q18 | 8,228 | 9,211 | 1.12x | ≈ |
| Q19 | 151 | 158 | 1.04x | ≈ |
| Q20 | 761 | 3,454 | 4.54x | PG |
| Q21 | 3,829 | 9,670 | 2.53x | PG |
| Q22 | 649 | 1,064 | 1.64x | PG |
Total (22 queries): PG ~59,935 ms vs ORCA ~71,863 ms — ORCA is 1.20x slower overall.
Key Findings
ORCA wins:
- Q9: 2x faster — better join order for the 8-way join with correlated subquery
- Q16: 1.8x faster — better plan for the NOT IN anti-join with part/supplier
Remaining regressions: - Q20 (4.5x): subquery decorrelation gone wrong — see analysis below - Q4 (3.7x): EXISTS subquery unnesting — see analysis below - Q5 (2.3x): multi-table join chain, join order suboptimal - Q21 (2.5x): complex multi-join with semi/anti-join - Q2 (1.7x): join order / index usage
Planning overhead: - ORCA planning is 50x–1150x slower than PG due to exhaustive search. On short-running queries this overhead is a significant fraction of total wall time.
Regression Analysis
Q11 — Fixed ✓
Commit: 46b96a6 cost: penalize index scan on non-leading composite key column
Root cause: partsupp_pkey is (ps_partkey, ps_suppkey). ORCA generated a Nested Loop
where the inner side was Index Scan using partsupp_pkey with condition
ps_suppkey = supplier.s_suppkey — using the non-leading column. Because the B-tree is
ordered by ps_partkey first, this required scanning almost the entire index (~2,196 pages)
per lookup. ORCA’s cost model treated it identically to a leading-column scan, underestimating
the cost by ~50,000x.
Fix: In CostIndexScan() (libgpdbcost/src/CCostModelGPDB.cpp), when the predicate does
not reference the first key column of a composite index, scale dIndexScanTupRandomFactor by
table_pages / rows_per_rebind. This reflects that O(table_pages) index entries must be read
rather than a small range, making the seq-scan + hash-join alternative cheaper.
Result: Q11 execution time 11,281 ms → 672 ms (16.8x improvement). The 2014 TODO
comment in CostIndexScan that requested exactly this logic was removed.
Also resolved as a side effect: Q17 (previously commented out as unsupported) now runs correctly at 2,168 ms (1.18x vs PG).
Q20 — Open
Symptom: 3,454 ms vs PG 761 ms (4.5x slower). work_mem increase to 256 MB eliminates
disk spill but gives no improvement (3,618 ms) — disk spill was not the bottleneck.
Root cause: Q20 contains a correlated scalar subquery in a HAVING-style filter:
ps_availqty > (SELECT 0.5 * sum(l_quantity) FROM lineitem
WHERE l_partkey = ps_partkey AND l_suppkey = ps_suppkey
AND l_shipdate BETWEEN '1994-01-01' AND '1995-01-01')
ORCA eagerly decorrelates this into a full HashAggregate(l_partkey, l_suppkey) over all
of lineitem before joining with the outer query. PG keeps it as a correlated SubPlan and
evaluates it lazily only for the 8,508 qualifying (ps_partkey, ps_suppkey) pairs.
| PG (lazy SubPlan) | ORCA (eager decorrelation) | |
|---|---|---|
| lineitem rows processed | 8,508 × ~1.1 rows via Bitmap Index Scan | 909,455 rows (full year scan) |
| Aggregate groups | 8,508 executions | 543,210 groups (64x more) |
| Disk spill | none | Batches=5, 11,872 KB (at 64 MB work_mem) |
| ORCA cardinality estimate for Hash Join | — | rows=1 (actual 5,833 — wildly wrong) |
The decorrelated plan processes 64x more lineitem data. ORCA’s cardinality estimate for the
join result (rows=1, actual 5,833) masked the true cost and led to selecting this path.
Fix direction: The subquery unnesting transform should compare the expected size of the
decorrelated intermediate result vs. the cost of correlated execution (outer rows × subplan
cost). When outer_qualifying_rows << decorrelated_result_rows, keeping the SubPlan wins.
Q4 — Open
Symptom: 4,330 ms vs PG 1,184 ms (3.66x slower).
Root cause: Q4 filters orders by date range and uses EXISTS to require at least one
late-delivery lineitem:
EXISTS (SELECT * FROM lineitem
WHERE l_orderkey = o_orderkey AND l_commitdate < l_receiptdate)
ORCA unnests EXISTS into a full deduplication + hash join:
SeqScan lineitem (l_commitdate < l_receiptdate) → 3,793,296 rows
HashAggregate(l_orderkey) → 1,375,365 distinct keys (88 MB hash table)
Hash Join with orders (57,218 rows)
PG uses a Nested Loop Semi Join with lineitem_pkey(l_orderkey, l_linenumber):
SeqScan orders → 57,218 rows
Index Scan lineitem_pkey (l_orderkey = o_orderkey, filter l_commitdate < l_receiptdate)
57,218 lookups × ~0.92 rows (semi-join stops at first match)
| PG (NL Semi Join + index) | ORCA (full dedup + hash join) | |
|---|---|---|
| lineitem rows accessed | ~114K (57K × ~2 via index) | 3,793,296 (full scan) |
| Deduplication | none (semi-join semantics) | HashAggregate, 1.375M groups, 88 MB |
| Index used | lineitem_pkey leading col l_orderkey ✓ |
none |
| ORCA row estimate | — | HashAggregate rows=437,556 (actual 1,375,365, 3.1x off) |
ORCA scans 33x more lineitem data. The cardinality underestimate (3.1x) hides the true hash table size and causes ORCA to prefer this plan.
Pattern shared with Q20 and Q21: Q4, Q20, and Q21 all involve ORCA eagerly unnesting a
subquery into a full-scan + aggregate/dedup or Hash Semi/Anti Join before applying outer
filters, while PG’s correlated or semi-join execution exploits available indexes and applies
filters early. The specific cause for Q4/Q21 is structural: ORCA lacks
EopPhysicalLeftSemiIndexNLJoin — see Q21 analysis for detail.
Fix direction: The EXISTS unnesting transform (CXformExistSubq2Join or equivalent) should
estimate the cost of keeping the correlated semi-join path. When the outer driving table is
small and an index covers the inner join key as a leading column, the nested-loop semi-join
is likely cheaper than full-scan + dedup.
Q21 — Open
Symptom: 9,670 ms vs PG 3,829 ms (2.53x slower).
Query structure: Q21 finds suppliers who had waiting shipments. It joins supplier,
lineitem l1 (late-delivery filter), orders, and uses:
- NOT EXISTS on l3 (same order, different supplier, also late) — Anti Join
- EXISTS on l2 (same order, different supplier) — Semi Join
ORCA plan:
Hash Join l1+supplier+nation ⋈ orders (729K-row hash, no spill)
Hash Semi Join l1 ⋈ l2 (EXISTS, l_orderkey)
Hash Anti Join l1 ⋈ l3 (NOT EXISTS, l_orderkey)
Hash Join l1 ⋈ supplier ⋈ nation → 156,739 rows
Seq Scan l1 (l_receiptdate > l_commitdate) → 3,793,296 rows
Hash l3: Seq Scan lineitem (3,793,296 rows) — Batches=2, disk spill 6,957 pages
Hash l2: Seq Scan lineitem (6,001,215 rows) — Batches=4, disk spill 15,376 pages
PG plan:
Nested Loop → orders (Index Scan orders_pkey, 8,357 lookups)
Nested Loop Semi Join → l2 (Index Scan lineitem_pkey, 13,859 lookups × ~0.6 rows)
Nested Loop Anti Join → l3 (Index Scan lineitem_pkey, 156,739 lookups × ~0.9 rows)
Hash Join l1 ⋈ supplier ⋈ nation → 156,739 rows
Seq Scan l1 (l_receiptdate > l_commitdate) → 3,793,296 rows
Data volume comparison:
| PG (NL + index) | ORCA (Hash + full scan) | |
|---|---|---|
| l3 rows read | ~142K (156,739 × 0.91 via index) | 3,793,296 (full scan + disk spill) |
| l2 rows read | ~8,300 (13,859 × 0.60 via index) | 6,001,215 (full scan + disk spill) |
| orders rows | 8,357 (index lookups) | 729,413 (full hash table) |
| Disk spill | none | l3: 6,957 pages; l2: 15,376 pages |
ORCA reads 26x more l3 data and 721x more l2 data than PG.
Root cause: missing Semi/Anti-Semi Index NL Join operators
ORCA’s physical operator set includes:
| Operator | Exists? |
|---|---|
EopPhysicalInnerIndexNLJoin |
yes |
EopPhysicalLeftOuterIndexNLJoin |
yes |
EopPhysicalLeftSemiIndexNLJoin |
no |
EopPhysicalLeftAntiSemiIndexNLJoin |
no |
CXformImplementIndexApply (the transform that instantiates physical Index NL Join operators)
only produces CPhysicalInnerIndexNLJoin and CPhysicalLeftOuterIndexNLJoin. It has no
branches for semi or anti-semi variants.
Because these operators don’t exist, ORCA can never use an index to probe the inner side
of an EXISTS / NOT EXISTS subquery. The only available physical alternatives are
CPhysicalLeftSemiHashJoin and CPhysicalLeftAntiSemiHashJoin, both of which require
materializing the entire inner relation into a hash table — hence the full scan of lineitem
(6M rows for l2, 3.8M for l3) and the resulting disk spill.
Pattern shared with Q4: Q4’s EXISTS regression has the identical structural cause: the
correlated EXISTS subquery is unnested into a full scan + dedup rather than an indexed
semi-join, for the same reason — no EopPhysicalLeftSemiIndexNLJoin exists to exploit
lineitem_pkey.
Fix direction: Implement CPhysicalLeftSemiIndexNLJoin and
CPhysicalLeftAntiSemiIndexNLJoin physical operators (mirroring the inner/outer variants),
and extend CXformImplementIndexApply to instantiate them. This is a significant effort
requiring new physical operator classes, cost model entries, and DXL translator support, but
would address Q4, Q21, and any other query using EXISTS/NOT EXISTS with an indexed inner key.