Contents
Where-Provenance
Where-provenance is the finest-grained form of provenance ProvSQL tracks: for every output cell, it records the set of source table cells that the value was copied from. This is in contrast to the semiring (why-provenance and friends) world, which works at the tuple level: a semiring annotation tells you which input rows contributed to an output row, but says nothing about where each individual value came from. See ../user/where-provenance for the user-facing description.
The persistent circuit store does not change when provsql.where_provenance is enabled: it is still the same DAG of gates held in the mmap backend (see memory). What changes is that the rewriter emits two additional gate types (project and eq) and -- on the C++ side -- where-provenance queries use a different in-memory reconstruction class (WhereCircuit) whose evaluator implements column-wise semantics, instead of the GenericCircuit / BooleanCircuit classes used by the semiring evaluators.
Why a Separate Circuit Class?
A semiring takes a value v and combines values with plus / times / monus, but the result is just another value of the same carrier type. Where-provenance is fundamentally different: the result is a vector of locator sets, one entry per output column. Plus and times must be promoted to act column-wise, and two new operators (project and eq) appear that have no semiring counterpart at all. Rather than shoehorn this into the semiring interface, ProvSQL uses a dedicated WhereCircuit class whose evaluator implements column-wise semantics directly. The underlying gate DAG that WhereCircuit walks is a sub-view of the same persistent mmap store that semiring evaluation reads from -- the two views just pay attention to different gate types.
The WhereCircuit Data Model
A WhereCircuit is the in-memory reconstruction of a sub-DAG of the mmap store specialised for where-provenance evaluation. Every gate has one of the following types (declared as WhereGate in WhereCircuit.h):
- IN -- a leaf representing one source tuple. Carries the origin table name, the tuple's UUID, and the number of columns.
- TIMES -- conjunction of children's locator vectors, concatenated column-wise. Used for joins / cross products.
- PLUS -- disjunction of children's locator vectors, unioned column-wise. Used for UNION and duplicate elimination. All children must have the same number of columns.
- PROJECT -- restricts and reorders the columns of its single child. Carries an integer vector positions[i] = j meaning "output column i takes its locators from input column j" (or 0 for "no source", e.g. an expression that is not a bare Var).
- EQ -- merges the locator sets of two columns of its single child. Used for equijoins: when A.x = B.y is part of the join condition, the EQ gate makes the output column for A.x carry the locators of both A.x and B.y.
A WhereCircuit is reconstructed in memory by where_provenance.cpp from the persistent mmap store every time the SQL function where_provenance(token) is called.
Evaluation: Locators
The output of evaluation is a std::vector<std::set<Locator>>, one entry per output column. A Locator (in WhereCircuit.h) is a triple (table, tuple_uuid, column_position) -- the address of exactly one cell in one base table. The interpretation of each gate type is straightforward:
- IN (table, uuid, n) returns a vector of length n, with cell i set to {(table, uuid, i)}.
- TIMES concatenates the vectors of its children (column-wise concatenation). This is the right operation because joins place the columns of the two operands side by side.
- PLUS takes the union of locator sets, column by column. Children must agree on the number of columns.
- PROJECT with positions = [p_1, ..., p_k] returns a vector of length k whose i-th entry is the child's pi-th column (or ∅ if pi = 0).
- EQ with positions (i, j) evaluates the child to vector v, then sets vi ← vi∪vj and vj ← vj∪vi.
The output of where_provenance(token) is rendered as {[loc;loc;...],[loc;...],...} -- one bracket-pair per output column, locators inside separated by semicolons.
Building the Circuit During Query Rewriting
When provsql.where_provenance is on, make_provenance_expression builds the where-provenance fragment in three phases on top of the regular semiring expression (see query-rewriting for the surrounding context):
- Column map. build_column_map walks the range table and assigns a global integer position to every output column of every base RTE. The provsql column itself is mapped to -1; join RTEs (which redirect to base columns) are mapped to 0. The result is a 2-D array columns[rte][att] used by the next two phases.
- EQ gates from join conditions. add_eq_from_Quals_to_Expr walks the ON clauses of the FROM jointree and the top-level WHERE clause looking for equality predicates of the form Var = Var. For each one, add_eq_from_OpExpr_to_Expr looks up both columns in the column map and wraps the current provenance expression in a provenance_eq(expr, col1, col2) call -- creating an EQ gate that records which two columns were compared.
- PROJECT gate from the SELECT list. After EQ gates have been added, make_provenance_expression walks the target list one more time, building an integer array of column positions that the SELECT clause keeps and in what order. If that ordering differs from the identity ([1, 2, 3, ...] covering every column of every input), the expression is wrapped in a provenance_project(expr, positions...) call. Output columns that are expressions rather than bare Vars get position 0 -- they have no source cell.
The result is a single token that, when evaluated by where_provenance, walks the circuit, applies the gate semantics above, and returns the per-column locator sets.
Sub-Circuit Materialization
To evaluate a where-provenance token, the C function where_provenance.cpp doesn't traverse the persistent mmap store directly; instead it calls the SQL helper provsql.sub_circuit_for_where(token), which walks the circuit recursively from the root and returns one row per gate, including the metadata each gate type needs (table name and column count for IN, position pair for EQ, projection vector for PROJECT, etc.). The C function then rebuilds a WhereCircuit from those rows and calls evaluate.
The reason for this round-trip is that where-provenance gates carry structured annotations (column lists, position pairs, table OIDs) that the generic gate inspection functions used by the semiring evaluator cannot return in one shot. The SQL helper joins all the relevant catalogs and returns everything the C reconstructor needs.
Why-Provenance vs Where-Provenance
It is worth being explicit about how the two flavours of provenance differ at the level of the data structures the developer guide describes:
| Why-provenance (semirings) | Where-provenance | |
|---|---|---|
| Granularity | Per output tuple | Per output cell |
| Output type | A semiring value (Boolean, integer, formula...) | vector<set<Locator>> |
| In-memory class | GenericCircuit / BooleanCircuit | WhereCircuit |
| Gate types used | input, plus, times, monus, delta, agg, semimod, value, cmp | input, plus, times, project, eq |
| Extra gates emitted | Always (when ProvSQL is active) | Only when provsql.where_provenance is on |
| Read by | provenance_evaluate, probability_evaluate, sr_*... | where_provenance |
Where-provenance does not interact with aggregation, HAVING, EXCEPT, or set operations beyond UNION ALL -- features covered by the semiring side -- because cell-level lineage has no agreed semantics under those operators. Enabling provsql.where_provenance does not turn off the semiring side; the same persistent circuit then contains both the regular semiring gates and the extra project / eq gates, and each gate type is interpreted by the evaluator that cares about it.