D2: Drop Stale Rebind Caching for Correlated Index Scans

Commit: 437f7fe cost: drop stale rebind caching for correlated index scans

Problem

Inner stats for a correlated subtree (e.g. IndexGet(l2 on l_orderkey = outer.col)) used to set NumRebinds = outer_rows at stats-derive time. The outer cardinality was baked into the group-shared inner stats. After join reorder placed the inner under a different (smaller) outer, the cached rebinds became stale, and CostIndexNLJoin multiplied per-probe cost by the original (large) outer row count.

Concrete instance — Q21 SF=1

SemiJoin(supplier ⋈ nation ⋈ l1, l2) had:

  • Actual outer rows (group 724): 96,010
  • Cached inner NumRebinds (group 56, l1_filtered): 2,400,253
  • LeftSemiIndexNLJoin cost: 31,125 — 6.5× more expensive than HashSemi (4,747)

The optimizer therefore picked Hash Semi Join + full table scan of l2 (6M rows) instead of NL + Index probe on idx_lineitem_orderkey. PG’s planner picked the NL+Index path and was 6.4× faster.

Fix

Three coordinated changes:

  1. CJoinStatsProcessor::DeriveStatsWithOuterRefs — no longer calls SetRebinds(num_rows_outer). Inner stats keep rebinds at the default 1. Rows() represents per-probe output; the actual NL outer cardinality is supplied at cost time.

  2. CostIndexNLJoin — extends rebind-compensation block to all IndexNL variants (Inner / LeftOuter / LeftSemi / LeftAntiSemi). Previously only InnerIndexNLJoin compensated; the others assumed cached inner rebinds equaled outer rows, which is no longer true.

  3. CostIndexScan — switches NL-inner detection in the warm-cache OLS path and the heap-fetch term from pci->NumRebinds() > 1 to exprhdl.HasOuterRefs(). Cached rebinds are now always 1, so the rebinds check would otherwise classify every correlated probe as a cold standalone scan and add ~100× heap-fetch cost.

Benchmark Methodology

  • Hardware: macOS 14.5, ARM64, 16 GB RAM
  • PostgreSQL 18 (single-node), warm cache
  • max_parallel_workers_per_gather = 0
  • optimizer_index_join_allowed_risk_threshold = 3 (default)
  • Median of 3 runs per query (warm-up discarded)
  • Fresh ANALYZE before each scale-factor run

Results

SF Baseline (no D2) D2 (this fix) Δ vs baseline PG Native ORCA/PG
SF=1 68,933 ms 58,510 ms −15.1% 42,429 1.38×
SF=2 188,816 ms 167,583 ms −11.2% 246,139 0.68×
SF=5 712,810 ms 627,078 ms −12.0% 382,880 1.64×

SF=2 PG total is inflated by PG’s own poor plan choices on Q5 (62 s) and Q7 (46 s); ORCA beats PG on those.

Per-Query Highlights (SF=5)

Significant improvements (>20%)

Query Baseline D2 Δ
Q19 5,414 ms 653 ms −87.9%
Q17 3,211 ms 423 ms −86.8%
Q3 41,519 ms 15,546 ms −62.6%
Q2 10,420 ms 4,911 ms −52.9%
Q22 8,102 ms 5,328 ms −34.2%
Q11 2,772 ms 1,873 ms −32.4%
Q21 173,058 ms 130,490 ms −24.6%
Q7 16,761 ms 12,863 ms −23.3%

Regressions

Query Baseline D2 Δ
Q12 12,641 ms 17,778 ms +40.6%
Q10 17,748 ms 21,209 ms +19.5%
Q9 156,425 ms 162,178 ms +3.7%

Q12 is the only consistent regression across SF=2/5 (SF=1 it improves). Worth a follow-up investigation.

Q21 across all SF

SF Baseline D2 Improvement
SF=1 10,225 ms 5,472 ms −46.5%
SF=2 33,866 ms 19,494 ms −42.4%
SF=5 173,058 ms 130,490 ms −24.6%

Threshold Sensitivity

Tested optimizer_index_join_allowed_risk_threshold = 10 on SF=5 — total time went from 641 s to 1,096 s (+71%). Q7 alone regressed from 14 s to 242 s. The default value 3 is correct for the general case; users can opt into 10 at the session level for plans that match Q21’s shape.

Files Changed

  • libgpdbcost/src/CCostModelGPDB.cppCostIndexNLJoin and CostIndexScan
  • libnaucrates/src/statistics/CJoinStatsProcessor.cppDeriveStatsWithOuterRefs

Total +20 lines, −10 lines.