Querying with Provenance

Once provenance is enabled on one or more tables, ProvSQL transparently rewrites every SQL query to propagate and combine provenance annotations. No changes to query syntax are required.

How It Works

ProvSQL installs a PostgreSQL planner hook (shared_preload_libraries is required for this reason). When a query involves a provenance-enabled table, the hook intercepts the query plan before execution and:

  1. Identifies all relations carrying a provsql column.
  2. Builds a provenance expression that combines the input tokens using the appropriate semiring operations (plus for alternative use of tuples such as in duplicate elimination, times for combined use of tuples such as in joins, monus for difference).
  3. Appends the resulting provenance token to the output as an extra column.

The final provenance token in each output row is a UUID that represents a gate in a provenance circuit – a DAG recording how that result was derived.

Supported SQL Features

The following SQL constructs are supported with full provenance tracking:

  • SELECT … FROM … WHERE (conjunctive queries, multiset semantics)
  • JOIN (inner joins, outer joins, natural joins)
  • LATERAL subqueries
  • Non-recursive CTEs (WITH clauses)
  • Subqueries in the FROM clause (including deeply nested)
  • GROUP BY
  • SELECT DISTINCT (set semantics)
  • UNION and UNION ALL
  • EXCEPT and EXCEPT ALL
  • VALUES tables (treated as having no provenance)
  • Aggregation (SUM, COUNT, MIN, MAX, AVG, COUNT(DISTINCT …), string_agg, array_agg)
  • HAVING (non-matching groups receive zero provenance 𝟘 rather than being filtered out)
  • Window functions (ROW_NUMBER, RANK, SUM OVER, LAG, LEAD, etc. — provenance stays per-row)
  • FILTER clause on aggregates
  • INSERT … SELECT (provenance propagated when target table is provenance-tracked)

Unsupported SQL Features

The following constructs are not currently supported; queries using them will either raise an error or may cause incorrect provenance tracking:

  • Subqueries outside FROM: EXISTS, NOT EXISTS, IN/NOT IN subqueries, scalar subqueries in SELECT
  • Recursive CTEs (WITH RECURSIVE)
  • INTERSECT
  • DISTINCT ON
  • GROUPING SETS, CUBE, ROLLUP
  • Operations on aggregate results requiring comparison or duplicate elimination: DISTINCT on aggregates, UNION/EXCEPT (non-ALL) with aggregates, ORDER BY or GROUP BY on aggregate results from a subquery

For negation or exclusion, use EXCEPT rather than NOT IN. For correlated subqueries, LATERAL can be used as a workaround. For comparison or duplicate elimination on aggregate results, explicitly cast the aggregate column to its base type (e.g., cnt::bigint), which extracts the value but loses the provenance information on that column.

Provenance in Nested Queries

Subqueries in the FROM clause are supported. Each sub-result carries its own provenance, which is further combined by the outer query:

SELECT t.name, provenance()
FROM (
    SELECT name FROM employees WHERE dept = 'R&D'
) t;

CREATE TABLE … AS SELECT

You can materialise a provenance-tracked query result into a new table. The new table automatically inherits provenance from its source:

CREATE TABLE derived AS
SELECT name, dept FROM employees WHERE active;

INSERT … SELECT

When both the source and target tables are provenance-tracked, INSERT … SELECT propagates provenance from the source query to the inserted rows:

CREATE TABLE archive (name VARCHAR, city VARCHAR);
SELECT add_provenance('archive');
INSERT INTO archive SELECT name, city FROM employees WHERE dept = 'R&D';

Each inserted row receives the provenance token computed by the source SELECT, rather than a fresh independent token.

If the target table does not have a provsql column, a warning is emitted indicating that source provenance is lost.

The provenance() Function

In a SELECT list, provenance() returns the provenance UUID of the current output tuple:

SELECT name, provenance() FROM mytable;

The token can be passed to semiring evaluation functions (see semirings) or to probability/Shapley functions.