Contents
- TPC-H Q17 Root Cause Analysis
TPC-H Q17 Root Cause Analysis
Status: Resolved on this branch. ORCA selects PG-style correlated NL+IndexScan at SF=1/10 and beats PG by 22× / 78× respectively. The “unnest + global HashAggregate” plan that previously dominated has been suppressed by reverting two cost-model commits that were originally intended to fix Q20 but in practice broke Q17.
Query
SELECT sum(l_extendedprice) / 7.0 AS avg_yearly
FROM lineitem, part
WHERE p_partkey = l_partkey
AND p_brand = 'Brand#23'
AND p_container = 'MED BOX'
AND l_quantity < (SELECT 0.2 * avg(l_quantity)
FROM lineitem
WHERE l_partkey = p_partkey);
The inner scalar subquery is correlated on p_partkey — for each
qualifying outer part it computes the per-part average lineitem
quantity.
Cardinalities
| SF=1 | SF=10 | |
|---|---|---|
part total |
200,000 | 2,000,000 |
part after brand + container |
204 | 2,044 |
lineitem total |
6,001,215 | 59,986,052 |
lineitem.l_partkey distinct |
200,000 | 2,000,000 |
part is filtered to 0.1% by the brand + container conjunction.
With ~30 lineitems per partkey, the correlated plan touches ~6,000
lineitem rows at SF=1 and ~60,000 at SF=10 — vs ~6M / ~60M for any
plan that scans lineitem in full.
Two Plan Shapes
“Correlated NL + IndexScan” (PG, ORCA on this branch)
Aggregate
└── NL outer × inner
│
├── NL Left Join (part × per-part avg) 2,044 outer rows
│ ├── Seq Scan part (filter brand+container)
│ └── HashAggregate per-call, ~30 rows in / 1 row out
│ └── IndexScan lineitem_partkey_idx ~30 rows / probe
│
└── IndexScan lineitem_partkey_idx ~3 rows / probe (after avg-filter)
Filter: l_quantity < (avg from outer)
Total lineitem touches: 2,044 × 30 = ~60,000. Wall clock at SF=10:
~1.3 s.
“Eager unnest + global HashAgg” (origin/main before this work)
Aggregate
└── Hash Join 6M outer × 200k inner
│ Filter: l_quantity < 0.2*avg
├── Seq Scan lineitem 60M rows, full scan
└── Hash on:
Hash Right Join 200k pre-aggregated rows
├── HashAggregate (60M → 2M groups) spill 21 batches, 1.5 GB temp
│ └── Seq Scan lineitem 60M rows, second full scan
└── Hash on Seq Scan part 2,044 rows after filter
Total lineitem touches: 60M + 60M = 120M. Wall clock at SF=10:
~100 s (mostly the spilled HashAgg).
Performance Across Scale Factors
| ORCA (origin/main) | ORCA (this branch) | PG | this branch speedup | |
|---|---|---|---|---|
| SF=1 | 7,000 ms | 76 ms | 1,800 ms | 22.4× ✅ |
| SF=10 | 100,100 ms | 1,280 ms | 25,000 ms | 19.5× ✅ |
| SF=20 | 214,500 ms | (not measured) | 57,000 ms | (extrapolated ~22×) |
PG’s plan is identical across SF (correlated SubPlan + Bitmap Index Scan);
its time grows because heap fetch surface area grows with SF. ORCA’s
plan on this branch is also identical across SF and benefits from the
same scaling: K_outer × per_probe is a constant, but per-probe heap
work grows slowly in linear fashion.
What Changed On This Branch
Two commits originally introduced to fix Q20 had the side effect of making Q17’s correlated-apply path appear far more expensive than its unnest alternative. Both have been reverted here:
| Reverted Commit | Description |
|---|---|
0fb8d27 |
fix: propagate NumRebinds through CGroupByStatsProcessor::CalcGroupByStats |
79e9bd3 |
fix: correctly price correlated IndexNLJoin inner child cost for Q20 |
Why they hurt Q17:
0fb8d27madeCLogicalGbAggover a correlatedCPhysicalIndexScancarryNumRebinds = outer_rowsinstead of1. This causesCostIndexNLJointo multiply the inner-subtree cost by ~5,000× when the apply has ~5,000 outer rows. For Q17 the inner-subtree cost represents the per-call HashAgg over IndexScan — already an honest per-rebind cost — so the extra multiplication double-counts and inflates the correlated path far above the unnest path.79e9bd3independently adds(num_rows_outer - 1) × inner_costinCostIndexNLJoinwhenpdRebinds[1] == GPOPT_DEFAULT_REBINDS. After0fb8d27,pdRebinds[1] != 1already (it’souter_rows), so this guard never fires for Q17. But for the unnest path (different join group), it adds nothing. Net effect: amplifies the correlated-path penalty without any matching adjustment for the unnest path.
Empirical results (Q20 SF=1):
| Q20 SF=1 | wall clock |
|---|---|
| origin/main (with both fixes) | 3,591 ms |
| this branch (both reverted) | 787 ms ✅ |
| this branch + A-group OLS calibration | 787 ms |
The “Q20 fix” commits actually regressed Q20 SF=1 by 4.5×. The
real fix for Q20 SF=1 is the OLS calibration of NL inner IndexScan
costs (cherry-picked here from the tpch branch as commits
0d5956a, 9294709, 250f513).
OLS-Calibrated Cost Model (Cherry-Picked from tpch)
The three cherry-picks together replace the MPP-era cost-model constants with values derived from warm-cache PG 18 OLS measurements on TPC-H lineitem index probes:
| Param | MPP default | This branch (NL inner) |
|---|---|---|
dEffectiveRandomFactor |
0.05 | 4.259e-3 (×0.085) |
dEffectiveScanTupCostUnit |
3.66e-6 | 1.725e-6 (×0.471) |
Derived from: - α = 4.914e-3 ms / probe (B-tree traversal, warm cache) - β = 6.628e-6 ms / (probe·row·byte) - C_unit = 1.154 ms / cost-unit
Activated only when NumRebinds > 1 AND the predicate covers the leading index column. The non-leading-column path keeps the original penalty so plans probing a non-prefix column remain correctly discouraged.
Net Bench Effect (SF=1, 22 queries × 3 runs)
| origin/main | this branch | Δ | |
|---|---|---|---|
| Total ORCA | 78,718 ms | 63,494 ms | −19% |
| Total PG | 61,265 ms | 60,608 ms | −1% |
| ORCA / PG | 1.28× | 1.05× | −18% |
Per-query highlights (origin/main → this branch):
| Q | origin/main | this branch | Δ | speedup change |
|---|---|---|---|---|
| 4 | 4,501 ms | 1,421 ms | −68% | 0.26× → 0.83× |
| 9 | 4,491 ms | 5,808 ms | +29% ⚠ | 2.02× → 1.57× |
| 10 | 3,021 ms | 1,711 ms | −43% | 1.00× → 1.77× |
| 14 | 2,289 ms | 2,677 ms | +17% ⚠ | 1.01× → 0.86× |
| 17 | 7,155 ms | 76 ms | −99% | 0.24× → 22.44× |
| 20 | 3,591 ms | 787 ms | −78% | 0.23× → 1.07× |
| Other (16 of 22) | < ±5% | within noise |
Two queries regressed (Q9 by ~1.3 s, Q14 by ~0.4 s) but both still beat or match PG (1.57× and 0.86×). The 14 s saved on Q17/Q20/Q4/Q10 dwarfs the 1.7 s lost on Q9/Q14.
Why The “Q20 Fix” Was Worse Than No Fix
Both reverted commits were authored after a Q20 timeout regression
appeared in the SF=5 internal bench. They successfully made the
correlated apply path more expensive than the decorrelated hash
join path for Q20 at SF=5 (commit 79e9bd3 reports “30 min → 7 s”).
At the same time, however, they made the same correlated-apply path
several orders of magnitude more expensive for Q17 at every SF
and for Q20 at SF=1, because ORCA’s cost model has no way to
distinguish Q17’s correlated apply with cheap per-probe IndexScan
inner from Q20’s correlated apply with expensive full-scan inner.
The OLS calibration cherry-picked here addresses the underlying issue more cleanly: it makes the per-probe IndexScan cost itself accurate, so the cost difference between Q17’s correlated path (cheap probes) and Q20’s correlated path (expensive scans) emerges naturally without needing the broad NumRebinds-based amplification.
Empirically the OLS calibration alone fixes both Q17 and Q20 at SF=1, and is sufficient for Q17 at SF=10 too. Q20 at SF=10 is roughly break-even between the two configurations (both ~9 s).
Reproduction
# At repo root, on this branch:
ls test/bench/tpch_bench.sh # bench harness
PG_CONFIG=$PG_CONFIG bash test/bench/tpch_bench.sh tpch 3
PG_CONFIG=$PG_CONFIG bash test/bench/tpch_bench.sh tpch_sf10 3
To inspect the correlated NL plan:
LOAD 'pg_orca';
SET pg_orca.enable_orca = on;
SET max_parallel_workers_per_gather = 0;
EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
SELECT sum(l_extendedprice) / 7.0 AS avg_yearly
FROM lineitem, part
WHERE p_partkey = l_partkey
AND p_brand = 'Brand#23'
AND p_container = 'MED BOX'
AND l_quantity < (SELECT 0.2 * avg(l_quantity)
FROM lineitem
WHERE l_partkey = p_partkey);
Expected plan top: Aggregate → Nested Loop → ... → Index Scan using
lineitem_partkey_idx (twice, once for the per-part avg subquery and
once for the outer join).
Open Questions / Future Work
Q9 / Q14 minor regressions: both still beat PG; root cause is the OLS-calibrated NL cost making one IndexNL look cheap enough to replace a HashJoin that was actually faster. May be addressable by a small additional discount to HashJoin build cost when build side is small (< 1M rows).
Q20 SF=10: ~9 s on either configuration. The unnest path scales linearly with lineitem; correlated path scales linearly with outer rows × per-probe lineitem read. Both are ~10× slower than SF=1, no pathological scaling. Improving SF=10 Q20 would require multi-column selectivity for the
partsupp.ps_partkey IN (part filter)semi-join.Whether to retain
risk_thresholdmechanism: the currentoptimizer_index_join_allowed_risk_threshold = 3is unchanged. With the OLS calibration in place, the risk multiplier is no longer load-bearing for the queries we tested. A future change could either raise the default to 4 (small SF=1 net positive, seeQ9 / Q10trade-off above) or remove the multiplier entirely if it can be shown to never fire under OLS-calibrated costs.