Biscuit - High-Performance Pattern Matching Index for PostgreSQL

Biscuit - High-Performance Pattern Matching Index for PostgreSQL

Biscuit is a specialized PostgreSQL index access method (IAM) designed for blazing-fast pattern matching on LIKE and ILIKE queries, with native support for multi-column searches. It eliminates the recheck overhead of trigram indexes while delivering significant performance improvements on wildcard-heavy queries. It stands for Bitmap Indexed Searching with Comprehensive Union and Intersection Techniques.

What’s new in 2.2.1?

🐞 Bug Fixes

Fixed recursive pattern matching

Resolved incorrect behavior when evaluating nested or repeated wildcard patterns during recursive matching.
Corrected underscore (_) handling in single-column indexing

_ now correctly operates on character-based offsets (not byte offsets), in accordance with SQL LIKE / ILIKE semantics, eliminating false matches in multi-byte UTF-8 text.

🛡️ Correctness & Stability

Improved internal consistency between single-column and multi-column pattern evaluation paths.
Resolved observed edge cases that could lead to incorrect matches under complex wildcard patterns.

Installation

Requirements

Build tools: gcc, make, pg_config
Recommended: CRoaring library for enhanced performance

From Source

# Clone repository
git clone https://github.com/Crystallinecore/biscuit.git
cd biscuit

# Build and install
make
sudo make install

# Enable in PostgreSQL
psql -d your_database -c "CREATE EXTENSION biscuit;"

From PGXN

pgxn install biscuit
psql -d your_database -c "CREATE EXTENSION biscuit;"

Quick Start

Basic Usage

-- Create a Biscuit index
CREATE INDEX idx_users_name ON users USING biscuit(name);

-- Query with wildcard patterns
SELECT * FROM users WHERE name LIKE '%john%';
SELECT * FROM users WHERE name NOT LIKE 'a%b%c';
SELECT COUNT(*) FROM users WHERE name LIKE '%test%';

Multi-Column Indexes

-- Create multi-column index
CREATE INDEX idx_products_search 
ON products USING biscuit(name, description, category);

-- Multi-column query (optimized automatically)
SELECT * FROM products 
WHERE name LIKE '%widget%' 
  AND description LIKE '%blue%'
  AND category LIKE 'electronics%'
LIMIT 10;

How It Works

Core Concept: Bitmap Position Indices

Biscuit builds the following bitmaps for every string:

1. Positive Indices (Forward)

Tracks which records have character c at position p:

String: "Hello"
Bitmaps:
  H@0 → {record_ids...}
  e@1 → {record_ids...}
  l@2 → {record_ids...}
  l@3 → {record_ids...}
  o@4 → {record_ids...}

2. Negative Indices (Backward)

Tracks which records have character c at position -p from the end:

String: "Hello"
Bitmaps:
  o@-1 → {record_ids...}  (last char)
  l@-2 → {record_ids...}  (second to last)
  l@-3 → {record_ids...}
  e@-4 → {record_ids...}
  H@-5 → {record_ids...}

3. Positive Indices (Case-insensitive)

Tracks which records have character c at position p:

String: "Hello"
Bitmaps:
  h@0 → {record_ids...}
  e@1 → {record_ids...}
  l@2 → {record_ids...}
  l@3 → {record_ids...}
  o@4 → {record_ids...}

4. Negative Indices (Case-insensitive)

Tracks which records have character c at position -p from the end:

String: "Hello"
Bitmaps:
  o@-1 → {record_ids...}  (last char)
  l@-2 → {record_ids...}  (second to last)
  l@-3 → {record_ids...}
  e@-4 → {record_ids...}
  h@-5 → {record_ids...}

5. Length Bitmaps

Two types for fast length filtering: - Exact length: length[5] → all 5-character strings - Minimum length: length_ge[3] → all strings ≥ 3 characters

Pattern Matching Algorithm

Example: `LIKE 'abc%def'`

Step 1: Parse pattern into parts Parts: ["abc", "def"] Starts with %: NO Ends with %: NO

Step 2: Match first part as prefix sql -- "abc" must start at position 0 Candidates = pos[a@0] ∩ pos[b@1] ∩ pos[c@2]

Step 3: Match last part at end (negative indexing) sql -- "def" must end at string end Candidates = Candidates ∩ neg[f@-1] ∩ neg[e@-2] ∩ neg[d@-3]

Step 4: Apply length constraint sql -- String must be at least 6 chars (abc + def) Candidates = Candidates ∩ length_ge[6]

Result: Exact matches, zero false positives

Why It’s Fast

1. Pure Bitmap Operations

// Traditional approach (pg_trgm)
for each trigram in pattern:
    candidates = scan_trigram_index(trigram)
    for each candidate:
        if !heap_fetch_and_recheck(candidate):  // SLOW: Random I/O
            remove candidate

// Biscuit approach
for each character at position:
    candidates &= bitmap[char][pos]  // FAST: In-memory AND
// No recheck needed!

2. Roaring Bitmaps

Compressed bitmap representation: - Sparse data: array of integers - Dense data: bitset - Automatic conversion for optimal memory

3. Negative Indexing Optimization

-- Pattern: '%xyz'
-- Traditional: Scan all strings, check suffix
-- Biscuit: Direct lookup in neg[z@-1] ∩ neg[y@-2] ∩ neg[x@-3]

12 Performance Optimizations

1. Skip Wildcard Intersections

// Pattern: "a_c" (underscore = any char)
// OLD: Intersect all 256 chars at position 1
// NEW: Skip position 1 entirely, only check a@0 and c@2

2. Early Termination on Empty

result = bitmap[a][0];
result &= bitmap[b][1];
if (result.empty()) return empty;  // Don't process remaining chars

3. Avoid Redundant Bitmap Copies

// OLD: Copy bitmap for every operation
// NEW: Operate in-place, copy only when branching

4. Optimized Single-Part Patterns

Fast paths for common cases: - Exact: 'abc' → Check position 0-2 and length = 3 - Prefix: 'abc%' → Check position 0-2 and length ≥ 3 - Suffix: '%xyz' → Check negative positions -3 to -1 and length ≥ 3 - Substring: '%abc%' → Check all positions, OR results

5. Skip Unnecessary Length Operations

// Pure wildcard patterns
if (pattern == "%%%___%%")  // 3 underscores
    return length_ge[3];     // No character checks needed!

6. TID Sorting for Sequential Heap Access

// Sort TIDs by (block_number, offset) before returning
// Converts random I/O into sequential I/O
// Uses radix sort for >5000 TIDs, quicksort for smaller sets

7. Batch TID Insertion

// For bitmap scans, insert TIDs in chunks
for (i = 0; i < num_results; i += 10000) {
    tbm_add_tuples(tbm, &tids[i], batch_size, false);
}

8. Direct Roaring Iteration

// OLD: Convert bitmap to array, then iterate
// NEW: Direct iterator, no intermediate allocation
roaring_uint32_iterator_t *iter = roaring_create_iterator(bitmap);
while (iter->has_value) {
    process(iter->current_value);
    roaring_advance_uint32_iterator(iter);
}

9. Batch Cleanup on Threshold

// After 1000 deletes, clean tombstones from all bitmaps
if (tombstone_count >= 1000) {
    for each bitmap:
        bitmap &= ~tombstones;  // Batch operation
    tombstones.clear();
}

10. Aggregate Query Detection

// COUNT(*), EXISTS, etc. don't need sorted TIDs
if (!scan->xs_want_itup) {
    skip_sorting = true;  // Save sorting time
}

11. LIMIT-Aware TID Collection

// If LIMIT 10 in query, don't collect more than needed
if (limit_hint > 0 && collected >= limit_hint)
    break;  // Early termination

12. Multi-Column Query Optimization

Predicate Reordering

Analyzes each column’s pattern and executes in order of selectivity:

-- Query:
WHERE name LIKE '%common%'           -- Low selectivity
  AND sku LIKE 'PROD-2024-%'         -- High selectivity (prefix)
  AND description LIKE '%rare_word%' -- Medium selectivity

-- Execution order (Biscuit automatically reorders):
1. sku LIKE 'PROD-2024-%'         (PREFIX, priority=20, selectivity=0.02)
2. description LIKE '%rare_word%' (SUBSTRING, priority=35, selectivity=0.15)
3. name LIKE '%common%'           (SUBSTRING, priority=55, selectivity=0.60)

Selectivity scoring formula: score = 1.0 / (concrete_chars + 1) - (underscore_count × 0.05) + (partition_count × 0.15) - (anchor_strength / 200)

Priority tiers: 1. 0-10: Exact matches, many underscores 2. 10-20: Non-% patterns with underscores 3. 20-30: Strong anchored patterns (prefix/suffix) 4. 30-40: Weak anchored patterns 5. 40-50: Multi-partition patterns 6. 50-60: Substring patterns (lowest priority)

Benchmarking

Setup Test Data

-- Create 1M row test table
CREATE TABLE benchmark (
    id SERIAL PRIMARY KEY,
    name TEXT,
    description TEXT,
    category TEXT,
    score FLOAT
);

INSERT INTO benchmark (name, description, category, score)
SELECT 
    'Name_' || md5(random()::text),
    'Description_' || md5(random()::text),
    'Category_' || (random() * 100)::int,
    random() * 1000
FROM generate_series(1, 1000000);

-- Create indexes
CREATE INDEX idx_trgm ON benchmark 
    USING gin(name gin_trgm_ops, description gin_trgm_ops);

CREATE INDEX idx_biscuit ON benchmark 
    USING biscuit(name, description, category);

ANALYZE benchmark;

Run Benchmarks

-- Single column, simple pattern
EXPLAIN ANALYZE
SELECT * FROM benchmark WHERE name LIKE '%abc%' LIMIT 100;

-- Multi-column, complex pattern
EXPLAIN ANALYZE
SELECT * FROM benchmark 
WHERE name LIKE '%a%b' 
  AND description LIKE '%bc%cd%'
ORDER BY score DESC 
LIMIT 10;

-- Aggregate query (COUNT)
EXPLAIN ANALYZE
SELECT COUNT(*) FROM benchmark 
WHERE name LIKE 'a%l%' 
  AND category LIKE 'f%d';

-- Complex multi-part pattern
EXPLAIN ANALYZE
SELECT * FROM benchmark 
WHERE description LIKE 'u%dc%x'
LIMIT 50;

View Index Statistics

-- Show internal statistics
SELECT biscuit_index_stats('idx_biscuit'::regclass);

Output:

----------------------------------------------------
 Biscuit Index Statistics (FULLY OPTIMIZED)        +
 ==========================================        +
 Index: idx_biscuit                                +
 Active records: 1000002                           +
 Total slots: 1000002                              +
 Free slots: 0                                     +
 Tombstones: 0                                     +
 Max length: 44                                    +
 ------------------------                          +
 CRUD Statistics:                                  +
   Inserts: 0                                      +
   Updates: 0                                      +
   Deletes: 0                                      +
 ------------------------                          +
 Active Optimizations:                             +
   ✓ 1. Skip wildcard intersections                +
   ✓ 2. Early termination on empty                 +
   ✓ 3. Avoid redundant copies                     +
   ✓ 4. Optimized single-part patterns             +
   ✓ 5. Skip unnecessary length ops                +
   ✓ 6. TID sorting for sequential I/O             +
   ✓ 7. Batch TID insertion                        +
   ✓ 8. Direct bitmap iteration                    +
   ✓ 9. Parallel bitmap scan support               +
   ✓ 10. Batch cleanup on threshold                +
   ✓ 11. Skip sorting for bitmap scans (aggregates)+
   ✓ 12. LIMIT-aware TID collection                +

Use Cases

1. Full-Text Search Applications

-- E-commerce product search
CREATE INDEX idx_products ON products 
    USING biscuit(name, brand, description);

SELECT * FROM products 
WHERE name LIKE '%laptop%' 
  AND brand LIKE 'ABC%'
  AND description LIKE '%gaming%'
ORDER BY price DESC 
LIMIT 20;

2. Log Analysis

-- Search error logs
CREATE INDEX idx_logs ON logs 
    USING biscuit(message, source, level);

SELECT * FROM logs 
WHERE message LIKE '%ERROR%connection%timeout%'
  AND source LIKE 'api.%'
  AND timestamp > NOW() - INTERVAL '1 hour'
LIMIT 100;

3. Customer Support / CRM

-- Search tickets by multiple fields
CREATE INDEX idx_tickets ON tickets 
    USING biscuit(subject, description, customer_name);

SELECT * FROM tickets 
WHERE subject LIKE '%refund%'
  AND customer_name LIKE 'John%'
  AND status = 'open';

4. Code Search / Documentation

-- Search code repositories
CREATE INDEX idx_files ON code_files 
    USING biscuit(filename, content, author);

SELECT * FROM code_files 
WHERE filename LIKE '%.py'
  AND content LIKE '%def%parse%json%'
  AND author LIKE 'team-%';

5. Analytics with Aggregates

-- Fast COUNT queries (no sorting overhead)
CREATE INDEX idx_events ON events 
    USING biscuit(event_type, user_agent, referrer);

SELECT COUNT(*) FROM events 
WHERE event_type LIKE 'click%'
  AND user_agent LIKE '%Mobile%'
  AND referrer LIKE '%google%';

Configuration

Build Options

Enable CRoaring for better performance:

Index Options

Currently, Biscuit doesn’t expose tunable options. All optimizations are automatic.

Limitations and Trade-offs

What Biscuit Does NOT Support

Regular expressions - Only LIKE / ILIKE patterns with % and _
Locale-specific collations - String comparisons are byte-based
Amcanorder = false - Cannot provide ordered scans directly (but see below)

ORDER BY + LIMIT Behavior

Biscuit doesn’t support ordered index scans (amcanorder = false), BUT:

PostgreSQL’s planner handles this efficiently: sql SELECT * FROM table WHERE col LIKE '%pattern%' ORDER BY score LIMIT 10;

Execution plan: Limit -> Sort (cheap, small result set) -> Biscuit Index Scan (fast filtering)

Why this works: - Biscuit filters candidates extremely fast - Result set is small after filtering - Sorting 100-1000 rows in memory is negligible (<1ms) - Net result: Still much faster than pg_trgm with recheck overhead in many cases

Memory Usage

Biscuit stores bitmaps in memory: - Use REINDEX to rebuild if index grows too large

Write Performance

INSERT: Similar to B-tree (must update bitmaps)
UPDATE: Two operations (remove old, insert new)
DELETE: Marks as tombstone, batch cleanup at threshold

Comparison with pg_trgm

Feature	Biscuit	pg_trgm (GIN)
Wildcard patterns	✔ Native	✔ Approximate
Recheck overhead	✔ None (deterministic)	✗ Required
Regex support	✗ No	✔ Yes
Similarity search	✗ No	✔ Yes
ILIKE support	✔ Full	✔ Native

When to use Biscuit: - Wildcard-heavy LIKE / ILIKE queries (%, _) - Multi-column pattern matching - Need exact results (no false positives) - COUNT(*) / aggregate queries - High query volume, can afford memory

When to use pg_trgm: - Fuzzy/similarity search (word <-> pattern) - Regular expressions - Memory-constrained environments - Write-heavy workloads

Development

Build from Source

git clone https://github.com/Crystallinecore/biscuit.git
cd biscuit

# Development build with debug symbols
make clean
CFLAGS="-g -O0 -DDEBUG" make

# Run tests
make installcheck

# Install
sudo make install

Testing

# Unit tests
make installcheck

# Manual testing
psql -d testdb

CREATE EXTENSION biscuit;

-- Create test table
CREATE TABLE test (id SERIAL, name TEXT);
INSERT INTO test (name) VALUES ('hello'), ('world'), ('test');

-- Create index
CREATE INDEX idx_test ON test USING biscuit(name);

-- Test queries
EXPLAIN ANALYZE SELECT * FROM test WHERE name LIKE '%ell%';

Debugging

Enable PostgreSQL debug logging:

SET client_min_messages = DEBUG1;
SET log_min_messages = DEBUG1;

-- Now run queries to see Biscuit's internal logs
SELECT * FROM test WHERE name LIKE '%pattern%';

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch (git checkout -b feature/amazing)
Make your changes with tests
Submit a pull request

Areas for Contribution

[ ] Implement amcanorder for native sorted scans
[ ] Add statistics collection for better cost estimation
[ ] Support for more data types
[ ] Parallel index build
[ ] Index compression options

License

MIT License - See LICENSE file for details.

Author

Sivaprasad Murali - Email: sivaprasad.off@gmail.com - GitHub: @Crystallinecore

Acknowledgments

The PostgreSQL community for the extensible index access method (AM) framework
B-tree and pg_trgm indexes that shaped the design space for pattern matching in PostgreSQL
The CRoaring library for efficient compressed bitmap operations

Support

Issues: GitHub Issues
Discussions: GitHub Discussions
- Documentation: ReadTheDocs Page

Happy pattern matching! Grab a biscuit 🍪 when pg_trgm feels half-baked!

PGXN

PostgreSQL Extension Network

Contents