TPC-H Q17 Root Cause Analysis

Status: Resolved on this branch. ORCA selects PG-style correlated NL+IndexScan at SF=1/10 and beats PG by 22× / 78× respectively. The “unnest + global HashAggregate” plan that previously dominated has been suppressed by reverting two cost-model commits that were originally intended to fix Q20 but in practice broke Q17.

Query

SELECT sum(l_extendedprice) / 7.0 AS avg_yearly
FROM   lineitem, part
WHERE  p_partkey  = l_partkey
  AND  p_brand    = 'Brand#23'
  AND  p_container = 'MED BOX'
  AND  l_quantity < (SELECT 0.2 * avg(l_quantity)
                     FROM   lineitem
                     WHERE  l_partkey = p_partkey);

The inner scalar subquery is correlated on p_partkey — for each qualifying outer part it computes the per-part average lineitem quantity.

Cardinalities

SF=1 SF=10
part total 200,000 2,000,000
part after brand + container 204 2,044
lineitem total 6,001,215 59,986,052
lineitem.l_partkey distinct 200,000 2,000,000

part is filtered to 0.1% by the brand + container conjunction. With ~30 lineitems per partkey, the correlated plan touches ~6,000 lineitem rows at SF=1 and ~60,000 at SF=10 — vs ~6M / ~60M for any plan that scans lineitem in full.

Two Plan Shapes

“Correlated NL + IndexScan” (PG, ORCA on this branch)

Aggregate
└── NL                                            outer × inner
    │
    ├── NL Left Join (part × per-part avg)        2,044 outer rows
    │   ├── Seq Scan part   (filter brand+container)
    │   └── HashAggregate                         per-call, ~30 rows in / 1 row out
    │       └── IndexScan lineitem_partkey_idx    ~30 rows / probe
    │
    └── IndexScan lineitem_partkey_idx            ~3 rows / probe (after avg-filter)
        Filter: l_quantity < (avg from outer)

Total lineitem touches: 2,044 × 30 = ~60,000. Wall clock at SF=10: ~1.3 s.

“Eager unnest + global HashAgg” (origin/main before this work)

Aggregate
└── Hash Join                                     6M outer × 200k inner
    │  Filter: l_quantity < 0.2*avg
    ├── Seq Scan lineitem                         60M rows, full scan
    └── Hash on:
        Hash Right Join                            200k pre-aggregated rows
        ├── HashAggregate  (60M → 2M groups)       spill 21 batches, 1.5 GB temp
        │   └── Seq Scan lineitem                  60M rows, second full scan
        └── Hash on Seq Scan part                  2,044 rows after filter

Total lineitem touches: 60M + 60M = 120M. Wall clock at SF=10: ~100 s (mostly the spilled HashAgg).

Performance Across Scale Factors

ORCA (origin/main) ORCA (this branch) PG this branch speedup
SF=1 7,000 ms 76 ms 1,800 ms 22.4×
SF=10 100,100 ms 1,280 ms 25,000 ms 19.5×
SF=20 214,500 ms (not measured) 57,000 ms (extrapolated ~22×)

PG’s plan is identical across SF (correlated SubPlan + Bitmap Index Scan); its time grows because heap fetch surface area grows with SF. ORCA’s plan on this branch is also identical across SF and benefits from the same scaling: K_outer × per_probe is a constant, but per-probe heap work grows slowly in linear fashion.

What Changed On This Branch

Two commits originally introduced to fix Q20 had the side effect of making Q17’s correlated-apply path appear far more expensive than its unnest alternative. Both have been reverted here:

Reverted Commit Description
0fb8d27 fix: propagate NumRebinds through CGroupByStatsProcessor::CalcGroupByStats
79e9bd3 fix: correctly price correlated IndexNLJoin inner child cost for Q20

Why they hurt Q17:

  • 0fb8d27 made CLogicalGbAgg over a correlated CPhysicalIndexScan carry NumRebinds = outer_rows instead of 1. This causes CostIndexNLJoin to multiply the inner-subtree cost by ~5,000× when the apply has ~5,000 outer rows. For Q17 the inner-subtree cost represents the per-call HashAgg over IndexScan — already an honest per-rebind cost — so the extra multiplication double-counts and inflates the correlated path far above the unnest path.
  • 79e9bd3 independently adds (num_rows_outer - 1) × inner_cost in CostIndexNLJoin when pdRebinds[1] == GPOPT_DEFAULT_REBINDS. After 0fb8d27, pdRebinds[1] != 1 already (it’s outer_rows), so this guard never fires for Q17. But for the unnest path (different join group), it adds nothing. Net effect: amplifies the correlated-path penalty without any matching adjustment for the unnest path.

Empirical results (Q20 SF=1):

Q20 SF=1 wall clock
origin/main (with both fixes) 3,591 ms
this branch (both reverted) 787 ms
this branch + A-group OLS calibration 787 ms

The “Q20 fix” commits actually regressed Q20 SF=1 by 4.5×. The real fix for Q20 SF=1 is the OLS calibration of NL inner IndexScan costs (cherry-picked here from the tpch branch as commits 0d5956a, 9294709, 250f513).

OLS-Calibrated Cost Model (Cherry-Picked from tpch)

The three cherry-picks together replace the MPP-era cost-model constants with values derived from warm-cache PG 18 OLS measurements on TPC-H lineitem index probes:

Param MPP default This branch (NL inner)
dEffectiveRandomFactor 0.05 4.259e-3 (×0.085)
dEffectiveScanTupCostUnit 3.66e-6 1.725e-6 (×0.471)

Derived from: - α = 4.914e-3 ms / probe (B-tree traversal, warm cache) - β = 6.628e-6 ms / (probe·row·byte) - C_unit = 1.154 ms / cost-unit

Activated only when NumRebinds > 1 AND the predicate covers the leading index column. The non-leading-column path keeps the original penalty so plans probing a non-prefix column remain correctly discouraged.

Net Bench Effect (SF=1, 22 queries × 3 runs)

origin/main this branch Δ
Total ORCA 78,718 ms 63,494 ms −19%
Total PG 61,265 ms 60,608 ms −1%
ORCA / PG 1.28× 1.05× −18%

Per-query highlights (origin/main → this branch):

Q origin/main this branch Δ speedup change
4 4,501 ms 1,421 ms −68% 0.26× → 0.83×
9 4,491 ms 5,808 ms +29% ⚠ 2.02× → 1.57×
10 3,021 ms 1,711 ms −43% 1.00× → 1.77×
14 2,289 ms 2,677 ms +17% ⚠ 1.01× → 0.86×
17 7,155 ms 76 ms −99% 0.24× → 22.44×
20 3,591 ms 787 ms −78% 0.23× → 1.07×
Other (16 of 22) < ±5% within noise

Two queries regressed (Q9 by ~1.3 s, Q14 by ~0.4 s) but both still beat or match PG (1.57× and 0.86×). The 14 s saved on Q17/Q20/Q4/Q10 dwarfs the 1.7 s lost on Q9/Q14.

Why The “Q20 Fix” Was Worse Than No Fix

Both reverted commits were authored after a Q20 timeout regression appeared in the SF=5 internal bench. They successfully made the correlated apply path more expensive than the decorrelated hash join path for Q20 at SF=5 (commit 79e9bd3 reports “30 min → 7 s”). At the same time, however, they made the same correlated-apply path several orders of magnitude more expensive for Q17 at every SF and for Q20 at SF=1, because ORCA’s cost model has no way to distinguish Q17’s correlated apply with cheap per-probe IndexScan inner from Q20’s correlated apply with expensive full-scan inner.

The OLS calibration cherry-picked here addresses the underlying issue more cleanly: it makes the per-probe IndexScan cost itself accurate, so the cost difference between Q17’s correlated path (cheap probes) and Q20’s correlated path (expensive scans) emerges naturally without needing the broad NumRebinds-based amplification.

Empirically the OLS calibration alone fixes both Q17 and Q20 at SF=1, and is sufficient for Q17 at SF=10 too. Q20 at SF=10 is roughly break-even between the two configurations (both ~9 s).

Reproduction

# At repo root, on this branch:
ls test/bench/tpch_bench.sh   # bench harness
PG_CONFIG=$PG_CONFIG bash test/bench/tpch_bench.sh tpch 3
PG_CONFIG=$PG_CONFIG bash test/bench/tpch_bench.sh tpch_sf10 3

To inspect the correlated NL plan:

LOAD 'pg_orca';
SET pg_orca.enable_orca = on;
SET max_parallel_workers_per_gather = 0;
EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
SELECT sum(l_extendedprice) / 7.0 AS avg_yearly
FROM   lineitem, part
WHERE  p_partkey  = l_partkey
  AND  p_brand    = 'Brand#23'
  AND  p_container = 'MED BOX'
  AND  l_quantity < (SELECT 0.2 * avg(l_quantity)
                     FROM   lineitem
                     WHERE  l_partkey = p_partkey);

Expected plan top: Aggregate → Nested Loop → ... → Index Scan using lineitem_partkey_idx (twice, once for the per-part avg subquery and once for the outer join).

Open Questions / Future Work

  1. Q9 / Q14 minor regressions: both still beat PG; root cause is the OLS-calibrated NL cost making one IndexNL look cheap enough to replace a HashJoin that was actually faster. May be addressable by a small additional discount to HashJoin build cost when build side is small (< 1M rows).

  2. Q20 SF=10: ~9 s on either configuration. The unnest path scales linearly with lineitem; correlated path scales linearly with outer rows × per-probe lineitem read. Both are ~10× slower than SF=1, no pathological scaling. Improving SF=10 Q20 would require multi-column selectivity for the partsupp.ps_partkey IN (part filter) semi-join.

  3. Whether to retain risk_threshold mechanism: the current optimizer_index_join_allowed_risk_threshold = 3 is unchanged. With the OLS calibration in place, the risk multiplier is no longer load-bearing for the queries we tested. A future change could either raise the default to 4 (small SF=1 net positive, see Q9 / Q10 trade-off above) or remove the multiplier entirely if it can be shown to never fire under OLS-calibrated costs.