Contents
pg_orca TODO
Status snapshot (as of 2026-05-19, commit 29fc891)
| Workload | ORCA vs PG (par=0) | ORCA vs PG (par=2) | Coverage edge | Source |
|---|---|---|---|---|
| TPC-H sf=10 | 1.12× geomean (ORCA total 636.7 s vs PG 657.9 s) | 0.49× geomean (ORCA 750.4 s vs PG 328.2 s) | — | test/bench/results/tpch_sf10_three_way_summary.md |
| TPC-H sf=5 | 1.07× geomean | — | — | par=0 baseline |
| TPC-DS sf=5 | sum: ORCA 1458 s vs PG 1492 s (common 90) | — | 9 / 99 queries ORCA completes that PG times out (≥120 s) | test/bench/results/tpcds_tpcds_sf5_compare.csv |
| TPC-DS sf=1 | ORCA 627 s vs PG 513 s (common 94) — ORCA 1.22× slower in total | — | 5 / 99 queries ORCA completes that PG times out (≥120 s) | test/bench/results/run_tpcds_sf1_20260509_1112.log |
Strengths — keep leveraging
Correlated subquery decorrelation — ORCA’s signature capability.
- ORCA rewrites
Applyalgebraically into a regularJoinvia theCXformApply2Joinfamily (libgpopt/src/xforms/CXformInnerApply2InnerJoin.cpp,CXformLeftSemiApply2LeftSemiJoin.cpp,CXformLeftOuterApply2LeftOuterJoin.cpp, and ~10 sibling xforms). The result is a single optimizable plan; PG often falls back to aSubPlan/InitPlanexecuted once per outer row. - PG handles
EXISTS/NOT EXISTSadequately; everything else (= (SELECT ...),IN (SELECT ...), nested correlation, subquery with aggregate) is where ORCA opens large gaps. Wins (sample, par=0):
Query Pattern Speedup TPC-H Q17 l_quantity < 0.2 * AVG(...) WHERE p_partkey = ...20.7× (par=0) / 20.4× (par=2) TPC-DS Q41 correlated IN + EXISTS 156× at sf=5 TPC-DS Q21 correlated min/max with HAVING 7.3× TPC-DS Q17 correlated aggregate 7.3× - Drives the TPC-DS coverage edge: 7 of 9 PG-timeout queries at sf=5 (Q1, Q6, Q11, Q14, Q30, Q74, Q81 — Q4 and Q95 mix in window/roll-up) are correlated-subquery dominated.
- Orthogonal to parallelism — Q17 still wins 20× at par=2. Structural plan-shape decision, not hardware throughput.
- ORCA rewrites
Cost-based exhaustive join-order enumeration (DPv2).
libgpopt/src/xforms/CJoinOrderDPv2.cpp+ the commutativity / associativity xforms enumerate the full join space under a cost model that uses ORCA’s cardinality estimates (not PG’s). Result: shape selection (left-deep vs bushy vs star) is cardinality-driven, not heuristic.- PG comparison:
geqo_threshold = 12switches to genetic algorithm above 12 relations; below it uses greedy DP. Neither path uses bushy plans aggressively, and both are sensitive to PG’s row estimates. Wins (sample, par=0):
Query Pattern Speedup TPC-H Q4 2-way + correlated EXISTS, anti-semi shape 2.55× TPC-H Q10 4-way + filter + ORDER BY LIMIT 1.71× TPC-H Q21 4-way + NOT EXISTS, anti-semi tree 1.26× TPC-DS Q25 6-way roll-up join 9.0× TPC-DS Q29 6-way roll-up join 7.0× - Controlled by
optimizer_join_order_threshold = 10(ingpopt/utils/COptTasks.cpp:402— currently not exposed as apg_orca.*GUC; raising it widens the search at the cost of more planning time). - Orthogonal to parallelism — join-tree shape decisions survive intact at par=2.
Statistics propagation through GROUP BY / DISTINCT in CTE / subquery is correct.
CGroupByStatsProcessor::CalcGroupByStatspreserves the full input histogram (incl. NDV) on grouping columns; CTE consumer inherits stats via colid mapping.- On TPC-DS Q31 this gives 3.28× vs PG at sf=5 (PG: 37.4 s vs ORCA: 11.4 s) — same root cause as upstream Richard Guo’s 2026-04 patch (
stadistinctpropagation through GROUP BY in CTEs). - Once that patch lands in PG, this particular gap will narrow. Hunt for the next family of stats-propagation wins before then (multi-column NDV, cross-column correlation, correlated-subquery decorrelation).
TPC-DS coverage edge — ORCA solves queries PG cannot complete.
- At
statement_timeout=120 s, sf=5: ORCA finishes 9 / 99 queries that PG times out on (Q1, Q4, Q6, Q11, Q14, Q30, Q74, Q81, Q95). Same setting at sf=1: 5 / 99 (Q1, Q4, Q6, Q11, Q74). These are not “ORCA marginally faster” — they are complex multi-join / CTE / window queries where PG’s planner picks a plan so bad it never returns. Full 99-query total once PG timeouts are counted (floor: 120 s each):
Suite ORCA total (all 99) PG total (lower bound) ORCA edge TPC-DS sf=5 1,836 s (1,458 s common + 378 s on the 9 PG-timeouts) ≥ 2,572 s (1,492 s + 9 × 120 s) ≥ 1.40× TPC-DS sf=1 627 s ≥ 1,113 s (513 s + 5 × 120 s) ≥ 1.77× 120 sis the timeout floor; PG’s true runtime on those queries is open-ended (could be many minutes — see Richard Guo’s 25.6 min on Q31 in upstream master). The real ORCA edge is strictly larger than the numbers above.- Within the head-to-head subset at sf=5, ORCA’s distribution is bimodal: many small losses (0.4–0.9×) but a few massive wins. Top wins (sf=5): Q41 156×, Q25 9.0×, Q21 7.3×, Q17 7.3×, Q29 7.0×, Q31 5.8×, Q39 5.8×, Q55 4.9×. Sum-of-times still nets in ORCA’s favor on the common 90 (1458 s vs 1492 s).
- These wins cluster around: multi-CTE self-joins (Q31/Q39), correlated subqueries (Q17/Q41), and roll-up aggregations (Q21/Q29). Same family as TPC-H Q17 — structural plan-shape wins.
- Worst losses (sf=5): Q61 (0.001×, ORCA 1270 ms vs PG 0.6 ms — pure planning overhead on a trivial query, see weakness #1), Q82/85/37/91/92 (0.06–0.18×). These are short queries where ORCA’s planning cost dominates, or specific plan-shape regressions worth case-by-case investigation.
- At
Single-thread baseline is the right benchmark for optimizer-quality work. par=2 numbers reflect hardware throughput, not optimizer changes.
Weaknesses — actively hurting us
1. Planning time ⚠️ kills short queries
Impact: ORCA’s CBO is consistently heavier than PG planner.
| Workload | ORCA planning | PG planning |
|---|---|---|
| Simple point lookup (1-row return) | 13.7 ms | 0.25 ms |
| Simple aggregate over 1 table | 3.8 ms | 0.14 ms |
| TPC-H Q4 (Inner Join + EXISTS + GroupBy) | 169 ms | 5.5 ms |
| TPC-DS Q31 (2 CTEs, 6-way self-join) | 192 ms | 3.6 ms |
For OLTP / short-running queries (< 50 ms exec), the planning overhead dominates total latency. A point lookup under ORCA takes ~14 ms wall-clock where PG takes < 1 ms — even if the executed plan is identical.
Why this matters: - Disqualifies pg_orca from any latency-sensitive workload (web request paths, transactional code). - TPC-H/TPC-DS benchmarks are exec-bound so the overhead is amortized — production mixed workloads are not.
Possible mitigation directions (no commitment yet):
- [ ] Query plan cache / shape-keyed memoization — cache the optimized DXL plan keyed by a normalized query shape + statistics generation number. Most OLTP traffic is small numbers of distinct query shapes; reusing planning would cut amortized cost dramatically. Requires invalidation on stats / DDL changes (hook into PG’s invalidate_*_callbacks).
- [ ] Heuristic fast-path bypass — let pg_orca short-circuit to PG planner for queries below a complexity threshold (single table, no joins, no aggregates, equality predicates on indexed columns). Add GUC pg_orca.fast_path_threshold (e.g. join_count). Cheapest win.
- [ ] Profile ORCA’s search loop. The 169 ms / 192 ms numbers suggest the cost is in Memo group exploration + transformation rules. Identify the dominant xforms with a flamegraph; some may be disable-able for simple shapes via existing optimizer.* GUCs.
- [ ] optimizer_join_order_threshold / optimizer_search_limit — already exist in ORCA but not exposed as PG-level GUCs in pg_orca. Wire them through and measure.
2. Existing items
- [ ] Partitioned tables — translation layer for
PartitionedTablescans / dynamic partition pruning (DXL hasPartitionSelectorbut stubs incompat/cdb/).
References
- TPC-H three-way analysis:
test/bench/results/tpch_sf10_three_way_summary.md - par=2 gap + engineering path:
test/bench/results/tpch_sf10_par2_summary.md - par=0 baseline:
test/bench/results/tpch_sf10_summary.md - TPC-DS Q31 stats propagation win: documented inline in 2026-05-19 conversation transcript