D2: Drop Stale Rebind Caching for Correlated Index Scans

D2: Drop Stale Rebind Caching for Correlated Index Scans

Commit: 437f7fe cost: drop stale rebind caching for correlated index scans

Problem

Inner stats for a correlated subtree (e.g. IndexGet(l2 on l_orderkey = outer.col)) used to set NumRebinds = outer_rows at stats-derive time. The outer cardinality was baked into the group-shared inner stats. After join reorder placed the inner under a different (smaller) outer, the cached rebinds became stale, and CostIndexNLJoin multiplied per-probe cost by the original (large) outer row count.

Concrete instance — Q21 SF=1

SemiJoin(supplier ⋈ nation ⋈ l1, l2) had:

Actual outer rows (group 724): 96,010
Cached inner NumRebinds (group 56, l1_filtered): 2,400,253
LeftSemiIndexNLJoin cost: 31,125 — 6.5× more expensive than HashSemi (4,747)

The optimizer therefore picked Hash Semi Join + full table scan of l2 (6M rows) instead of NL + Index probe on idx_lineitem_orderkey. PG’s planner picked the NL+Index path and was 6.4× faster.

Fix

Three coordinated changes:

CJoinStatsProcessor::DeriveStatsWithOuterRefs — no longer calls SetRebinds(num_rows_outer). Inner stats keep rebinds at the default 1. Rows() represents per-probe output; the actual NL outer cardinality is supplied at cost time.
CostIndexNLJoin — extends rebind-compensation block to all IndexNL variants (Inner / LeftOuter / LeftSemi / LeftAntiSemi). Previously only InnerIndexNLJoin compensated; the others assumed cached inner rebinds equaled outer rows, which is no longer true.
CostIndexScan — switches NL-inner detection in the warm-cache OLS path and the heap-fetch term from pci->NumRebinds() > 1 to exprhdl.HasOuterRefs(). Cached rebinds are now always 1, so the rebinds check would otherwise classify every correlated probe as a cold standalone scan and add ~100× heap-fetch cost.

Benchmark Methodology

Hardware: macOS 14.5, ARM64, 16 GB RAM
PostgreSQL 18 (single-node), warm cache
max_parallel_workers_per_gather = 0
optimizer_index_join_allowed_risk_threshold = 3 (default)
Median of 3 runs per query (warm-up discarded)
Fresh ANALYZE before each scale-factor run

Results

SF	Baseline (no D2)	D2 (this fix)	Δ vs baseline	PG Native	ORCA/PG
SF=1	68,933 ms	58,510 ms	−15.1%	42,429	1.38×
SF=2	188,816 ms	167,583 ms	−11.2%	246,139	0.68×
SF=5	712,810 ms	627,078 ms	−12.0%	382,880	1.64×

SF=2 PG total is inflated by PG’s own poor plan choices on Q5 (62 s) and Q7 (46 s); ORCA beats PG on those.

Per-Query Highlights (SF=5)

Significant improvements (>20%)

Query	Baseline	D2	Δ
Q19	5,414 ms	653 ms	−87.9%
Q17	3,211 ms	423 ms	−86.8%
Q3	41,519 ms	15,546 ms	−62.6%
Q2	10,420 ms	4,911 ms	−52.9%
Q22	8,102 ms	5,328 ms	−34.2%
Q11	2,772 ms	1,873 ms	−32.4%
Q21	173,058 ms	130,490 ms	−24.6%
Q7	16,761 ms	12,863 ms	−23.3%

Regressions

Query	Baseline	D2	Δ
Q12	12,641 ms	17,778 ms	+40.6%
Q10	17,748 ms	21,209 ms	+19.5%
Q9	156,425 ms	162,178 ms	+3.7%

Q12 is the only consistent regression across SF=2/5 (SF=1 it improves). Worth a follow-up investigation.

Q21 across all SF

SF	Baseline	D2	Improvement
SF=1	10,225 ms	5,472 ms	−46.5%
SF=2	33,866 ms	19,494 ms	−42.4%
SF=5	173,058 ms	130,490 ms	−24.6%

Threshold Sensitivity

Tested optimizer_index_join_allowed_risk_threshold = 10 on SF=5 — total time went from 641 s to 1,096 s (+71%). Q7 alone regressed from 14 s to 242 s. The default value 3 is correct for the general case; users can opt into 10 at the session level for plans that match Q21’s shape.

Files Changed

libgpdbcost/src/CCostModelGPDB.cpp — CostIndexNLJoin and CostIndexScan
libnaucrates/src/statistics/CJoinStatsProcessor.cpp — DeriveStatsWithOuterRefs

Total +20 lines, −10 lines.

PGXN

PostgreSQL Extension Network

Contents