TPC-H Q17 Root Cause Analysis

TPC-H Q17 Root Cause Analysis

Status: Resolved on this branch. ORCA selects PG-style correlated NL+IndexScan at SF=1/10 and beats PG by 22× / 78× respectively. The “unnest + global HashAggregate” plan that previously dominated has been suppressed by reverting two cost-model commits that were originally intended to fix Q20 but in practice broke Q17.

Query

SELECT sum(l_extendedprice) / 7.0 AS avg_yearly
FROM   lineitem, part
WHERE  p_partkey  = l_partkey
  AND  p_brand    = 'Brand#23'
  AND  p_container = 'MED BOX'
  AND  l_quantity < (SELECT 0.2 * avg(l_quantity)
                     FROM   lineitem
                     WHERE  l_partkey = p_partkey);

The inner scalar subquery is correlated on p_partkey — for each qualifying outer part it computes the per-part average lineitem quantity.

Cardinalities

	SF=1	SF=10
`part` total	200,000	2,000,000
`part` after `brand` + `container`	204	2,044
`lineitem` total	6,001,215	59,986,052
`lineitem.l_partkey` distinct	200,000	2,000,000

part is filtered to 0.1% by the brand + container conjunction. With ~30 lineitems per partkey, the correlated plan touches ~6,000 lineitem rows at SF=1 and ~60,000 at SF=10 — vs ~6M / ~60M for any plan that scans lineitem in full.

Two Plan Shapes

“Correlated NL + IndexScan” (PG, ORCA on this branch)

Aggregate
└── NL                                            outer × inner
    │
    ├── NL Left Join (part × per-part avg)        2,044 outer rows
    │   ├── Seq Scan part   (filter brand+container)
    │   └── HashAggregate                         per-call, ~30 rows in / 1 row out
    │       └── IndexScan lineitem_partkey_idx    ~30 rows / probe
    │
    └── IndexScan lineitem_partkey_idx            ~3 rows / probe (after avg-filter)
        Filter: l_quantity < (avg from outer)

Total lineitem touches: 2,044 × 30 = ~60,000. Wall clock at SF=10: ~1.3 s.

“Eager unnest + global HashAgg” (origin/main before this work)

Aggregate
└── Hash Join                                     6M outer × 200k inner
    │  Filter: l_quantity < 0.2*avg
    ├── Seq Scan lineitem                         60M rows, full scan
    └── Hash on:
        Hash Right Join                            200k pre-aggregated rows
        ├── HashAggregate  (60M → 2M groups)       spill 21 batches, 1.5 GB temp
        │   └── Seq Scan lineitem                  60M rows, second full scan
        └── Hash on Seq Scan part                  2,044 rows after filter

Total lineitem touches: 60M + 60M = 120M. Wall clock at SF=10: ~100 s (mostly the spilled HashAgg).

Performance Across Scale Factors

	ORCA (origin/main)	ORCA (this branch)	PG	this branch speedup
SF=1	7,000 ms	76 ms	1,800 ms	22.4× ✅
SF=10	100,100 ms	1,280 ms	25,000 ms	19.5× ✅
SF=20	214,500 ms	(not measured)	57,000 ms	(extrapolated ~22×)

PG’s plan is identical across SF (correlated SubPlan + Bitmap Index Scan); its time grows because heap fetch surface area grows with SF. ORCA’s plan on this branch is also identical across SF and benefits from the same scaling: K_outer × per_probe is a constant, but per-probe heap work grows slowly in linear fashion.

What Changed On This Branch

Two commits originally introduced to fix Q20 had the side effect of making Q17’s correlated-apply path appear far more expensive than its unnest alternative. Both have been reverted here:

Reverted Commit	Description
`0fb8d27`	fix: propagate NumRebinds through CGroupByStatsProcessor::CalcGroupByStats
`79e9bd3`	fix: correctly price correlated IndexNLJoin inner child cost for Q20

Why they hurt Q17:

0fb8d27 made CLogicalGbAgg over a correlated CPhysicalIndexScan carry NumRebinds = outer_rows instead of 1. This causes CostIndexNLJoin to multiply the inner-subtree cost by ~5,000× when the apply has ~5,000 outer rows. For Q17 the inner-subtree cost represents the per-call HashAgg over IndexScan — already an honest per-rebind cost — so the extra multiplication double-counts and inflates the correlated path far above the unnest path.
79e9bd3 independently adds (num_rows_outer - 1) × inner_cost in CostIndexNLJoin when pdRebinds[1] == GPOPT_DEFAULT_REBINDS. After 0fb8d27, pdRebinds[1] != 1 already (it’s outer_rows), so this guard never fires for Q17. But for the unnest path (different join group), it adds nothing. Net effect: amplifies the correlated-path penalty without any matching adjustment for the unnest path.

Empirical results (Q20 SF=1):

Q20 SF=1	wall clock
origin/main (with both fixes)	3,591 ms
this branch (both reverted)	787 ms ✅
this branch + A-group OLS calibration	787 ms

The “Q20 fix” commits actually regressed Q20 SF=1 by 4.5×. The real fix for Q20 SF=1 is the OLS calibration of NL inner IndexScan costs (cherry-picked here from the tpch branch as commits 0d5956a, 9294709, 250f513).

OLS-Calibrated Cost Model (Cherry-Picked from `tpch`)

The three cherry-picks together replace the MPP-era cost-model constants with values derived from warm-cache PG 18 OLS measurements on TPC-H lineitem index probes:

Param	MPP default	This branch (NL inner)
`dEffectiveRandomFactor`	0.05	4.259e-3 (×0.085)
`dEffectiveScanTupCostUnit`	3.66e-6	1.725e-6 (×0.471)

Derived from: - α = 4.914e-3 ms / probe (B-tree traversal, warm cache) - β = 6.628e-6 ms / (probe·row·byte) - C_unit = 1.154 ms / cost-unit

Activated only when NumRebinds > 1 AND the predicate covers the leading index column. The non-leading-column path keeps the original penalty so plans probing a non-prefix column remain correctly discouraged.

Net Bench Effect (SF=1, 22 queries × 3 runs)

	origin/main	this branch	Δ
Total ORCA	78,718 ms	63,494 ms	−19%
Total PG	61,265 ms	60,608 ms	−1%
ORCA / PG	1.28×	1.05×	−18%

Per-query highlights (origin/main → this branch):

Q	origin/main	this branch	Δ	speedup change
4	4,501 ms	1,421 ms	−68%	0.26× → 0.83×
9	4,491 ms	5,808 ms	+29% ⚠	2.02× → 1.57×
10	3,021 ms	1,711 ms	−43%	1.00× → 1.77×
14	2,289 ms	2,677 ms	+17% ⚠	1.01× → 0.86×
17	7,155 ms	76 ms	−99%	0.24× → 22.44×
20	3,591 ms	787 ms	−78%	0.23× → 1.07×
Other (16 of 22)	< ±5%	within noise

Two queries regressed (Q9 by ~1.3 s, Q14 by ~0.4 s) but both still beat or match PG (1.57× and 0.86×). The 14 s saved on Q17/Q20/Q4/Q10 dwarfs the 1.7 s lost on Q9/Q14.

Why The “Q20 Fix” Was Worse Than No Fix

Both reverted commits were authored after a Q20 timeout regression appeared in the SF=5 internal bench. They successfully made the correlated apply path more expensive than the decorrelated hash join path for Q20 at SF=5 (commit 79e9bd3 reports “30 min → 7 s”). At the same time, however, they made the same correlated-apply path several orders of magnitude more expensive for Q17 at every SF and for Q20 at SF=1, because ORCA’s cost model has no way to distinguish Q17’s correlated apply with cheap per-probe IndexScan inner from Q20’s correlated apply with expensive full-scan inner.

The OLS calibration cherry-picked here addresses the underlying issue more cleanly: it makes the per-probe IndexScan cost itself accurate, so the cost difference between Q17’s correlated path (cheap probes) and Q20’s correlated path (expensive scans) emerges naturally without needing the broad NumRebinds-based amplification.

Empirically the OLS calibration alone fixes both Q17 and Q20 at SF=1, and is sufficient for Q17 at SF=10 too. Q20 at SF=10 is roughly break-even between the two configurations (both ~9 s).

Reproduction

# At repo root, on this branch:
ls test/bench/tpch_bench.sh   # bench harness
PG_CONFIG=$PG_CONFIG bash test/bench/tpch_bench.sh tpch 3
PG_CONFIG=$PG_CONFIG bash test/bench/tpch_bench.sh tpch_sf10 3

To inspect the correlated NL plan:

LOAD 'pg_orca';
SET pg_orca.enable_orca = on;
SET max_parallel_workers_per_gather = 0;
EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
SELECT sum(l_extendedprice) / 7.0 AS avg_yearly
FROM   lineitem, part
WHERE  p_partkey  = l_partkey
  AND  p_brand    = 'Brand#23'
  AND  p_container = 'MED BOX'
  AND  l_quantity < (SELECT 0.2 * avg(l_quantity)
                     FROM   lineitem
                     WHERE  l_partkey = p_partkey);

Expected plan top: Aggregate → Nested Loop → ... → Index Scan using lineitem_partkey_idx (twice, once for the per-part avg subquery and once for the outer join).

Open Questions / Future Work

Q9 / Q14 minor regressions: both still beat PG; root cause is the OLS-calibrated NL cost making one IndexNL look cheap enough to replace a HashJoin that was actually faster. May be addressable by a small additional discount to HashJoin build cost when build side is small (< 1M rows).
Q20 SF=10: ~9 s on either configuration. The unnest path scales linearly with lineitem; correlated path scales linearly with outer rows × per-probe lineitem read. Both are ~10× slower than SF=1, no pathological scaling. Improving SF=10 Q20 would require multi-column selectivity for the partsupp.ps_partkey IN (part filter) semi-join.
Whether to retain risk_threshold mechanism: the current optimizer_index_join_allowed_risk_threshold = 3 is unchanged. With the OLS calibration in place, the risk multiplier is no longer load-bearing for the queries we tested. A future change could either raise the default to 4 (small SF=1 net positive, see Q9 / Q10 trade-off above) or remove the multiplier entirely if it can be shown to never fire under OLS-calibrated costs.

PGXN

PostgreSQL Extension Network

Contents

TPC-H Q17 Root Cause Analysis

Query

Cardinalities

Two Plan Shapes

“Correlated NL + IndexScan” (PG, ORCA on this branch)

“Eager unnest + global HashAgg” (origin/main before this work)

Performance Across Scale Factors

What Changed On This Branch

OLS-Calibrated Cost Model (Cherry-Picked from `tpch`)

Net Bench Effect (SF=1, 22 queries × 3 runs)

Why The “Q20 Fix” Was Worse Than No Fix

Reproduction

Open Questions / Future Work

PGXN

PostgreSQL Extension Network

Contents

TPC-H Q17 Root Cause Analysis

Query

Cardinalities

Two Plan Shapes

“Correlated NL + IndexScan” (PG, ORCA on this branch)

“Eager unnest + global HashAgg” (origin/main before this work)

Performance Across Scale Factors

What Changed On This Branch

OLS-Calibrated Cost Model (Cherry-Picked from tpch)

Net Bench Effect (SF=1, 22 queries × 3 runs)

Why The “Q20 Fix” Was Worse Than No Fix

Reproduction

Open Questions / Future Work

OLS-Calibrated Cost Model (Cherry-Picked from `tpch`)