Contents
Memory Management
ProvSQL persists provenance circuits across transactions and shares them between PostgreSQL backend processes. This is achieved through memory-mapped files managed by a dedicated background worker, coordinated via PostgreSQL shared memory.
Why Not Regular Tables?
Earlier versions of ProvSQL stored circuits in regular PostgreSQL tables. This worked but had severe performance limitations: the PL/pgSQL functions that insert gates used LOCK to serialize access, preventing parallelism. The current architecture uses memory-mapped files written by a single background worker, avoiding both lock contention and WAL overhead. Benchmarks in the ICDE 2026 paper sen2026provsql show that the mmap implementation scales linearly with dataset size, while an earlier shared-memory variant hit limits at moderate scale factors.
Background Worker: provsql_mmap
The mmap subsystem lives in provsql_mmap.c and MMappedCircuit.cpp.
RegisterProvSQLMMapWorker (called from _PG_init) registers a PostgreSQL background worker with the postmaster. When the server starts, the postmaster forks the worker process, which calls provsql_mmap_worker:
- Initialization: initialize_provsql_mmap opens (or creates) four memory-mapped files that back the circuit data structures. A singleton MMappedCircuit instance is created over these files.
- Main loop: provsql_mmap_main_loop reads gate-creation messages from a pipe, writes them to the mmap store, and sends acknowledgements back. It handles SIGTERM for graceful shutdown.
- Shutdown: destroy_provsql_mmap syncs and closes the memory-mapped files.
Inter-Process Communication
Normal backends (the processes that execute SQL queries) cannot write directly to the mmap files -- only the background worker does. Two anonymous pipes provide bidirectional IPC:
- Main-to-background (pipembr / pipembw): backends send gate-creation requests.
- Background-to-main (pipebmr / pipebmw): the worker sends acknowledgements (e.g., the UUID of a newly created gate).
Pipe writes use buffered macros (STARTWRITEM / ADDWRITEM / SENDWRITEM) that respect PIPE_BUF atomicity guarantees -- each message is delivered as an atomic unit even when multiple backends write concurrently.
Shared Memory: provsql_shmem
The pipe file descriptors and a lightweight lock live in a PostgreSQL shared-memory segment managed by provsql_shmem.c.
The provsqlSharedState structure contains:
- lock -- a PostgreSQL LWLock that serializes pipe writes from multiple backends. Backends acquire it in exclusive mode before writing to the main-to-background pipe, ensuring message atomicity.
- pipebmr / pipebmw -- file descriptors for the background-to-main pipe.
- pipembr / pipembw -- file descriptors for the main-to-background pipe.
Lifecycle:
- provsql_shmem_request (called from shmem_request_hook on PostgreSQL >= 15) reserves the required shared-memory size.
- provsql_shmem_startup (called from shmem_startup_hook) allocates the segment, creates the pipes, and initializes the lock.
Locking helpers (provsql_shmem_lock_exclusive, provsql_shmem_lock_shared, provsql_shmem_unlock) wrap the LWLock API.
Mmap-Backed Data Structures
MMappedCircuit (in MMappedCircuit.cpp) is the persistent circuit store. It holds four mmap-backed containers:
- A gate-type vector mapping gate IDs to their gate_type.
- A wire list storing the children of each gate.
- A UUID hash table (MMappedUUIDHashTable) mapping UUID tokens to gate IDs, enabling O(1) lookup.
- An annotation store for per-gate metadata (probabilities, aggregate info, extra labels).
MMappedVector (MMappedVector.h / MMappedVector.hpp) provides a std::vector-like interface over an mmap region, supporting push_back, random access, and iteration.
MMappedUUIDHashTable (MMappedUUIDHashTable.h) is an open-addressing hash table keyed by 16-byte UUIDs, stored in an mmap region.
These data structures grow by extending the underlying file and remapping. Because only the background worker writes, there are no concurrency issues within the mmap files themselves.
Per-Backend Circuit Cache
Every access to the persistent circuit -- creating a gate, reading a gate type, fetching the children of a gate -- goes through the anonymous pipe to the mmap worker. That pipe trip is cheap but not free, and for a query that touches thousands of gates the round-trips dominate the wall-clock cost of the SQL functions that wrap them. CircuitCache (in CircuitCache.cpp, with a C-linkage shim in circuit_cache.h) is a small in-process cache whose sole purpose is to avoid those round-trips for gates that the same backend has seen recently.
A cache entry stores a gate's UUID, its gate_type, and the list of its children. The cache is backed by a Boost multi_index_container with two indices: a sequenced index (used as FIFO eviction order) and a hashed-unique index on the UUID (for O(1) lookup). It is bounded by a fixed byte budget; when inserting a new entry would exceed the budget, CircuitCache::insert drops the oldest one. The cache is single-threaded: it lives as a file-scope singleton in CircuitCache.cpp, so each PostgreSQL backend process has its own instance and there is no sharing between backends.
The C functions in circuit_cache.h that provsql_mmap.c actually calls are:
- circuit_cache_create_gate -- insert a gate into the cache. Returns true if the gate was already cached, in which case the caller can skip the IPC write. This is the fast path for create_gate: if a query allocates the same gate twice in the same backend (easy to trigger with shared sub-circuits), the second call is a hash-table lookup, not a pipe write.
- circuit_cache_get_type -- look up a gate's type. Returns gate_invalid on a miss; the SQL wrapper then falls back to an IPC read and, on success, re-enters the gate into the cache so that subsequent lookups hit.
- circuit_cache_get_children -- same pattern for the children list, used by get_children.
The cache is write-through, not write-back: every gate-creating call still reaches the mmap worker in the end (either directly, for a cache miss, or because an earlier call in the same backend wrote it through), so the persistent store always reflects the complete circuit even though each individual lookup may resolve locally.
Reading Circuits Back
When a semiring evaluation function is called (e.g., sr_boolean), it needs to build an in-memory circuit from the persistent mmap store. The function getGenericCircuit performs a breadth-first traversal starting from the root gate's UUID, reading gates and wires from the mmap store and constructing a GenericCircuit in process-local memory. This is the primitive used by all circuit readers.
Probability evaluation (probability_evaluate), Shapley and Banzhaf value computation, and the BoolExpr semiring need a pure Boolean circuit (AND / OR / NOT) suitable for knowledge compilation and model counting. They call getBooleanCircuit, which first builds a GenericCircuit via getGenericCircuit and then evaluates it under the BoolExpr semiring to produce a BooleanCircuit.