README / PostgreSQL Extension Network

Compares re2 throughput against PostgreSQL builtin POSIX regex (ARE)

Benchmark

Category	re2	builtin
match	`re2match`	`regexp_like`
extract	`re2extract`	`regexp_substr`
extract all	`re2extractall`	`regexp_matches(…, 'g')`
replace one	`re2replaceregexpone`	`regexp_replace`
replace all	`re2replaceregexpall`	`regexp_replace(…, 'g')`
count matches	`re2countmatches`	`regexp_count`

Patterns span literal, character class, alternation, nested quantifier, IP / email validation, deep alternation, and a ReDoS-shaped (e?){10}e{10} case. Both RE2 (automaton) and PG (ARE) handle last one without catastrophic backtracking

Data is 10000 rows of:

email ~40 chars
logline ~200 chars
longtext ~2000 chars (400 words)

Index scans

re2 also speeds up re2match through two index mechanisms (see Index Support). These queries compare each against the equivalent PostgreSQL index scan over a separate 100000-row table.

Mechanism	re2	postgres
b-tree prefix range	`re2match(col, '^lit')`	`col ~ '^lit'`
GIN trigram	`col @~ pat` (`gin_re2_ops`)	`col ~ pat` (`gin_trgm_ops`)

Index benchmark

Category	Pattern	rows	re2	postgres	re2 vs postgres
btree	`^user5`	11111	1.8 ms	3.5 ms	1.9x faster
btree	`^user12[0-9]`	1110	0.21 ms	0.43 ms	2.0x faster
gin	`error_code=123`	100	3.3 ms	3.6 ms	1.1x faster
gin	`error_code=(100\|200\|300)`	301	3.5 ms	4.9 ms	1.4x faster

The two GIN opclasses extract keys differently. pg_trgm builds trigrams from alphanumeric words only (never spanning _, =, …) and prunes extracted trigrams under a fixed penalty budget tuned for natural-language text; gin_re2_ops keeps every byte trigram of each literal atom RE2’s FilteredRE2 requires. On punctuated machine-text patterns (e.g. error_code=42[0-9] over loglines where error_code= appears in every row) pruning can leave pg_trgm with only ubiquitous trigrams, degenerating to a full-index scan while gin_re2_ops stays selective, an order of magnitude faster. On plain-word patterns both extract similar keys and pg_trgm’s cheaper consistent check can win (see error_code=123 above).

Methodology

JIT and query parallelism disabled to compare single-thread engine throughput reliably
gen_graph.py takes the median time per (pattern, engine) across all iterations
Index scans use a text_pattern_ops b-tree and two GIN indexes on one table; enable_seqscan is off there so both engines are measured on their index

Running

Requires re2 (see README) and PostgreSQL 15+ for builtin comparisons. The index-scan section additionally needs the pg_trgm contrib extension; setup.sql creates it.

Connection uses libpq environment variables; override the psql binary with PSQL:

PGDATABASE=mydb ./run_bench.sh        # 5 iterations (default)
PGDATABASE=mydb ./run_bench.sh 10     # 10 iterations
./gen_graph.py                        # regenerate graph.png & graph_index.png

PGXN

PostgreSQL Extension Network

Contents

Index scans

Methodology

Running