Contents
D2: Drop Stale Rebind Caching for Correlated Index Scans
Commit: 437f7fe cost: drop stale rebind caching for correlated index scans
Problem
Inner stats for a correlated subtree (e.g. IndexGet(l2 on l_orderkey = outer.col))
used to set NumRebinds = outer_rows at stats-derive time. The outer cardinality
was baked into the group-shared inner stats. After join reorder placed the inner
under a different (smaller) outer, the cached rebinds became stale, and
CostIndexNLJoin multiplied per-probe cost by the original (large) outer row count.
Concrete instance — Q21 SF=1
SemiJoin(supplier ⋈ nation ⋈ l1, l2) had:
- Actual outer rows (group 724): 96,010
- Cached inner
NumRebinds(group 56, l1_filtered): 2,400,253 - LeftSemiIndexNLJoin cost: 31,125 — 6.5× more expensive than HashSemi (4,747)
The optimizer therefore picked Hash Semi Join + full table scan of l2 (6M rows)
instead of NL + Index probe on idx_lineitem_orderkey. PG’s planner picked the
NL+Index path and was 6.4× faster.
Fix
Three coordinated changes:
CJoinStatsProcessor::DeriveStatsWithOuterRefs— no longer callsSetRebinds(num_rows_outer). Inner stats keep rebinds at the default 1.Rows()represents per-probe output; the actual NL outer cardinality is supplied at cost time.CostIndexNLJoin— extends rebind-compensation block to all IndexNL variants (Inner / LeftOuter / LeftSemi / LeftAntiSemi). Previously onlyInnerIndexNLJoincompensated; the others assumed cached inner rebinds equaled outer rows, which is no longer true.CostIndexScan— switches NL-inner detection in the warm-cache OLS path and the heap-fetch term frompci->NumRebinds() > 1toexprhdl.HasOuterRefs(). Cached rebinds are now always 1, so the rebinds check would otherwise classify every correlated probe as a cold standalone scan and add ~100× heap-fetch cost.
Benchmark Methodology
- Hardware: macOS 14.5, ARM64, 16 GB RAM
- PostgreSQL 18 (single-node), warm cache
max_parallel_workers_per_gather = 0optimizer_index_join_allowed_risk_threshold = 3(default)- Median of 3 runs per query (warm-up discarded)
- Fresh
ANALYZEbefore each scale-factor run
Results
| SF | Baseline (no D2) | D2 (this fix) | Δ vs baseline | PG Native | ORCA/PG |
|---|---|---|---|---|---|
| SF=1 | 68,933 ms | 58,510 ms | −15.1% | 42,429 | 1.38× |
| SF=2 | 188,816 ms | 167,583 ms | −11.2% | 246,139 | 0.68× |
| SF=5 | 712,810 ms | 627,078 ms | −12.0% | 382,880 | 1.64× |
SF=2 PG total is inflated by PG’s own poor plan choices on Q5 (62 s) and Q7 (46 s); ORCA beats PG on those.
Per-Query Highlights (SF=5)
Significant improvements (>20%)
| Query | Baseline | D2 | Δ |
|---|---|---|---|
| Q19 | 5,414 ms | 653 ms | −87.9% |
| Q17 | 3,211 ms | 423 ms | −86.8% |
| Q3 | 41,519 ms | 15,546 ms | −62.6% |
| Q2 | 10,420 ms | 4,911 ms | −52.9% |
| Q22 | 8,102 ms | 5,328 ms | −34.2% |
| Q11 | 2,772 ms | 1,873 ms | −32.4% |
| Q21 | 173,058 ms | 130,490 ms | −24.6% |
| Q7 | 16,761 ms | 12,863 ms | −23.3% |
Regressions
| Query | Baseline | D2 | Δ |
|---|---|---|---|
| Q12 | 12,641 ms | 17,778 ms | +40.6% |
| Q10 | 17,748 ms | 21,209 ms | +19.5% |
| Q9 | 156,425 ms | 162,178 ms | +3.7% |
Q12 is the only consistent regression across SF=2/5 (SF=1 it improves). Worth a follow-up investigation.
Q21 across all SF
| SF | Baseline | D2 | Improvement |
|---|---|---|---|
| SF=1 | 10,225 ms | 5,472 ms | −46.5% |
| SF=2 | 33,866 ms | 19,494 ms | −42.4% |
| SF=5 | 173,058 ms | 130,490 ms | −24.6% |
Threshold Sensitivity
Tested optimizer_index_join_allowed_risk_threshold = 10 on SF=5 — total time
went from 641 s to 1,096 s (+71%). Q7 alone regressed from 14 s to 242 s. The
default value 3 is correct for the general case; users can opt into 10 at the
session level for plans that match Q21’s shape.
Files Changed
libgpdbcost/src/CCostModelGPDB.cpp—CostIndexNLJoinandCostIndexScanlibnaucrates/src/statistics/CJoinStatsProcessor.cpp—DeriveStatsWithOuterRefs
Total +20 lines, −10 lines.