biscuit 2.3.1

This Release
biscuit 2.3.1
Date
Status
Stable
Latest Testing
biscuit 2.3.0 —
Other Releases
Abstract
A bitmap-based index for wildcard pattern matching
Description
Biscuit is a bitmap-based index access method for accelerating LIKE and ILIKE pattern matching in PostgreSQL. The project is actively developed and seeking broader testing across diverse workloads.
Released By
crystallinecore
License
MIT
Resources
Special Files
Tags

Extensions

biscuit 2.3.0

Documentation

CHANGELOG
Biscuit Index Extension - Changelog

README

Biscuit - High-Performance Pattern Matching Index for PostgreSQL

License: MIT PostgreSQL: 16+ Read the Docs

Biscuit is a specialized PostgreSQL index access method (IAM) designed for blazing-fast pattern matching on LIKE and ILIKE queries, with native support for multi-column searches. It eliminates the recheck overhead of trigram indexes while delivering significant performance improvements on wildcard-heavy queries. It stands for Bitmap Indexed Searching with Comprehensive Union and Intersection Techniques.


Stability Notice

This extension is currently under active development and has not yet received the level of testing and operational experience expected of production-ready software.

Users are encouraged to evaluate the extension thoroughly in development and staging environments before considering deployment in production systems. In particular, testing should include representative datasets, workloads, upgrade procedures, backup and recovery workflows, and performance validation.

Although the extension is intended to operate safely and reliably, defects or unexpected behavior may still be present. As with any new database component, appropriate backups and validation procedures should be maintained before use.

At this stage, the extension is best suited for evaluation, experimentation, and non-critical workloads. Production deployment should be undertaken only after careful testing and assessment of its suitability for the intended environment.


Version 2.3.0 — Bagel

New Features

  • Parallel index scan support: Biscuit now integrates with parallel query execution in PostgreSQL, allowing Gather plans to distribute work across workers without duplicate results.

  • Pre-lowercased cache for multi-column indexes: Added column_data_cache_lower to accelerate ILIKE queries by eliminating repeated string normalization during scans.

  • LIKE / ILIKE matching: Pattern matching now correctly handles %, _, escape sequences, and complex wildcard combinations.

  • Version updated to 2.3.0 - Bagel.

Bug Fixes

  • Fixed a crash that could occur when INSERT operations followed SELECT queries on partially loaded indexes.

  • Fixed an issue where newly inserted rows could become invisible to subsequent queries due to stale session cache entries.

  • Fixed multi-column indexes failing to update length-based bitmap structures during inserts.

  • Fixed several memory initialization issues during cache growth that could cause incorrect results or instability.

  • Fixed insert operations losing in-memory changes after relcache invalidation.

  • Prevented the planner from selecting Biscuit for unqualified scans where no index predicates are present.

  • Fixed single-column scans using incorrect query paths for LIKE and ILIKE operations.

  • Fixed an issue where indexes could remain in a cold state even after background preloading had completed.

  • Improved cache update behavior to avoid unnecessary remove-and-reinsert cycles during inserts.

Performance Improvements

  • Eliminated per-row allocations during ILIKE fallback scans by using pre-lowercased caches.

  • Simplified TID collection by consolidating scan paths into a single parallel-aware implementation.

Internal Changes

  • Reworked the parallel scan infrastructure around a shared-memory descriptor model and added support for PostgreSQL’s parallel index scan callbacks.

  • Removed unused LIMIT-tracking logic that was ineffective with the PostgreSQL access method API.


Installation

Requirements

  • Build tools: gcc, make, pg_config
  • Recommended: CRoaring library for enhanced performance

From Source

# Clone repository
git clone https://github.com/Crystallinecore/biscuit.git
cd biscuit

# Build and install
make
sudo make install

# Enable in PostgreSQL
psql -d your_database -c "CREATE EXTENSION biscuit;"

From PGXN

pgxn install biscuit
psql -d your_database -c "CREATE EXTENSION biscuit;"

Quick Start

Basic Usage

-- Create a Biscuit index
CREATE INDEX idx_users_name ON users USING biscuit(name);

-- Query with wildcard patterns
SELECT * FROM users WHERE name LIKE '%john%';
SELECT * FROM users WHERE name NOT LIKE 'a%b%c';
SELECT COUNT(*) FROM users WHERE name LIKE '%test%';

Multi-Column Indexes

-- Create multi-column index
CREATE INDEX idx_products_search 
ON products USING biscuit(name, description, category);

-- Multi-column query (optimized automatically)
SELECT * FROM products 
WHERE name LIKE '%widget%' 
  AND description LIKE '%blue%'
  AND category LIKE 'electronics%'
LIMIT 10;

How It Works

Core Concept: Bitmap Position Indices

Biscuit builds the following bitmaps for every string:

1. Positive Indices (Forward)

Tracks which records have character c at position p:

String: "Hello"
Bitmaps:
  H@0 → {record_ids...}
  e@1 → {record_ids...}
  l@2 → {record_ids...}
  l@3 → {record_ids...}
  o@4 → {record_ids...}

2. Negative Indices (Backward)

Tracks which records have character c at position -p from the end:

String: "Hello"
Bitmaps:
  o@-1 → {record_ids...}  (last char)
  l@-2 → {record_ids...}  (second to last)
  l@-3 → {record_ids...}
  e@-4 → {record_ids...}
  H@-5 → {record_ids...}

3. Positive Indices (Case-insensitive)

Tracks which records have character c at position p:

String: "Hello"
Bitmaps:
  h@0 → {record_ids...}
  e@1 → {record_ids...}
  l@2 → {record_ids...}
  l@3 → {record_ids...}
  o@4 → {record_ids...}

4. Negative Indices (Case-insensitive)

Tracks which records have character c at position -p from the end:

String: "Hello"
Bitmaps:
  o@-1 → {record_ids...}  (last char)
  l@-2 → {record_ids...}  (second to last)
  l@-3 → {record_ids...}
  e@-4 → {record_ids...}
  h@-5 → {record_ids...}

5. Length Bitmaps

Two types for fast length filtering: - Exact length: length[5] → all 5-character strings - Minimum length: length_ge[3] → all strings ≥ 3 characters


Pattern Matching Algorithm

Example: LIKE 'abc%def'

Step 1: Parse pattern into parts Parts: ["abc", "def"] Starts with %: NO Ends with %: NO

Step 2: Match first part as prefix sql -- "abc" must start at position 0 Candidates = pos[a@0] ∩ pos[b@1] ∩ pos[c@2]

Step 3: Match last part at end (negative indexing) sql -- "def" must end at string end Candidates = Candidates ∩ neg[f@-1] ∩ neg[e@-2] ∩ neg[d@-3]

Step 4: Apply length constraint sql -- String must be at least 6 chars (abc + def) Candidates = Candidates ∩ length_ge[6]

Result: Exact matches, zero false positives


Why It’s Fast

1. Pure Bitmap Operations

// Traditional approach (pg_trgm)
for each trigram in pattern:
    candidates = scan_trigram_index(trigram)
    for each candidate:
        if !heap_fetch_and_recheck(candidate):  // SLOW: Random I/O
            remove candidate

// Biscuit approach
for each character at position:
    candidates &= bitmap[char][pos]  // FAST: In-memory AND
// No recheck needed!

2. Roaring Bitmaps

Compressed bitmap representation: - Sparse data: array of integers - Dense data: bitset - Automatic conversion for optimal memory

3. Negative Indexing Optimization

-- Pattern: '%xyz'
-- Traditional: Scan all strings, check suffix
-- Biscuit: Direct lookup in neg[z@-1] ∩ neg[y@-2] ∩ neg[x@-3]

12 Performance Optimizations

1. Skip Wildcard Intersections

// Pattern: "a_c" (underscore = any char)
// OLD: Intersect all 256 chars at position 1
// NEW: Skip position 1 entirely, only check a@0 and c@2

2. Early Termination on Empty

result = bitmap[a][0];
result &= bitmap[b][1];
if (result.empty()) return empty;  // Don't process remaining chars

3. Avoid Redundant Bitmap Copies

// OLD: Copy bitmap for every operation
// NEW: Operate in-place, copy only when branching

4. Optimized Single-Part Patterns

Fast paths for common cases: - Exact: 'abc' → Check position 0-2 and length = 3 - Prefix: 'abc%' → Check position 0-2 and length ≥ 3 - Suffix: '%xyz' → Check negative positions -3 to -1 and length ≥ 3 - Substring: '%abc%' → Check all positions, OR results

5. Skip Unnecessary Length Operations

// Pure wildcard patterns
if (pattern == "%%%___%%")  // 3 underscores
    return length_ge[3];     // No character checks needed!

6. TID Sorting for Sequential Heap Access

// Sort TIDs by (block_number, offset) before returning
// Converts random I/O into sequential I/O
// Uses radix sort for >5000 TIDs, quicksort for smaller sets

7. Batch TID Insertion

// For bitmap scans, insert TIDs in chunks
for (i = 0; i < num_results; i += 10000) {
    tbm_add_tuples(tbm, &tids[i], batch_size, false);
}

8. Direct Roaring Iteration

// OLD: Convert bitmap to array, then iterate
// NEW: Direct iterator, no intermediate allocation
roaring_uint32_iterator_t *iter = roaring_create_iterator(bitmap);
while (iter->has_value) {
    process(iter->current_value);
    roaring_advance_uint32_iterator(iter);
}

9. Batch Cleanup on Threshold

// After 1000 deletes, clean tombstones from all bitmaps
if (tombstone_count >= 1000) {
    for each bitmap:
        bitmap &= ~tombstones;  // Batch operation
    tombstones.clear();
}

10. Aggregate Query Detection

// COUNT(*), EXISTS, etc. don't need sorted TIDs
if (!scan->xs_want_itup) {
    skip_sorting = true;  // Save sorting time
}

11. LIMIT-Aware TID Collection

// If LIMIT 10 in query, don't collect more than needed
if (limit_hint > 0 && collected >= limit_hint)
    break;  // Early termination

12. Multi-Column Query Optimization

Predicate Reordering

Analyzes each column’s pattern and executes in order of selectivity:

-- Query:
WHERE name LIKE '%common%'           -- Low selectivity
  AND sku LIKE 'PROD-2024-%'         -- High selectivity (prefix)
  AND description LIKE '%rare_word%' -- Medium selectivity

-- Execution order (Biscuit automatically reorders):
1. sku LIKE 'PROD-2024-%'         (PREFIX, priority=20, selectivity=0.02)
2. description LIKE '%rare_word%' (SUBSTRING, priority=35, selectivity=0.15)
3. name LIKE '%common%'           (SUBSTRING, priority=55, selectivity=0.60)

Selectivity scoring formula: score = 1.0 / (concrete_chars + 1) - (underscore_count × 0.05) + (partition_count × 0.15) - (anchor_strength / 200)

Priority tiers: 1. 0-10: Exact matches, many underscores 2. 10-20: Non-% patterns with underscores 3. 20-30: Strong anchored patterns (prefix/suffix) 4. 30-40: Weak anchored patterns 5. 40-50: Multi-partition patterns 6. 50-60: Substring patterns (lowest priority)


Benchmarking

Setup Test Data

-- Create 1M row test table
CREATE TABLE benchmark (
    id SERIAL PRIMARY KEY,
    name TEXT,
    description TEXT,
    category TEXT,
    score FLOAT
);

INSERT INTO benchmark (name, description, category, score)
SELECT 
    'Name_' || md5(random()::text),
    'Description_' || md5(random()::text),
    'Category_' || (random() * 100)::int,
    random() * 1000
FROM generate_series(1, 1000000);

-- Create indexes
CREATE INDEX idx_trgm ON benchmark 
    USING gin(name gin_trgm_ops, description gin_trgm_ops);

CREATE INDEX idx_biscuit ON benchmark 
    USING biscuit(name, description, category);

ANALYZE benchmark;

Run Benchmarks

-- Single column, simple pattern
EXPLAIN ANALYZE
SELECT * FROM benchmark WHERE name LIKE '%abc%' LIMIT 100;

-- Multi-column, complex pattern
EXPLAIN ANALYZE
SELECT * FROM benchmark 
WHERE name LIKE '%a%b' 
  AND description LIKE '%bc%cd%'
ORDER BY score DESC 
LIMIT 10;

-- Aggregate query (COUNT)
EXPLAIN ANALYZE
SELECT COUNT(*) FROM benchmark 
WHERE name LIKE 'a%l%' 
  AND category LIKE 'f%d';

-- Complex multi-part pattern
EXPLAIN ANALYZE
SELECT * FROM benchmark 
WHERE description LIKE 'u%dc%x'
LIMIT 50;

View Index Statistics

-- Show internal statistics
SELECT biscuit_index_stats('idx_biscuit'::regclass);

Output:

----------------------------------------------------
 Biscuit Index Statistics (FULLY OPTIMIZED)        +
 ==========================================        +
 Index: idx_biscuit                                +
 Active records: 1000002                           +
 Total slots: 1000002                              +
 Free slots: 0                                     +
 Tombstones: 0                                     +
 Max length: 44                                    +
 ------------------------                          +
 CRUD Statistics:                                  +
   Inserts: 0                                      +
   Updates: 0                                      +
   Deletes: 0                                      +
 ------------------------                          +
 Active Optimizations:                             +
   ✓ 1. Skip wildcard intersections                +
   ✓ 2. Early termination on empty                 +
   ✓ 3. Avoid redundant copies                     +
   ✓ 4. Optimized single-part patterns             +
   ✓ 5. Skip unnecessary length ops                +
   ✓ 6. TID sorting for sequential I/O             +
   ✓ 7. Batch TID insertion                        +
   ✓ 8. Direct bitmap iteration                    +
   ✓ 9. Parallel bitmap scan support               +
   ✓ 10. Batch cleanup on threshold                +
   ✓ 11. Skip sorting for bitmap scans (aggregates)+
   ✓ 12. LIMIT-aware TID collection                +

Use Cases

1. Full-Text Search Applications

-- E-commerce product search
CREATE INDEX idx_products ON products 
    USING biscuit(name, brand, description);

SELECT * FROM products 
WHERE name LIKE '%laptop%' 
  AND brand LIKE 'ABC%'
  AND description LIKE '%gaming%'
ORDER BY price DESC 
LIMIT 20;

2. Log Analysis

-- Search error logs
CREATE INDEX idx_logs ON logs 
    USING biscuit(message, source, level);

SELECT * FROM logs 
WHERE message LIKE '%ERROR%connection%timeout%'
  AND source LIKE 'api.%'
  AND timestamp > NOW() - INTERVAL '1 hour'
LIMIT 100;

3. Customer Support / CRM

-- Search tickets by multiple fields
CREATE INDEX idx_tickets ON tickets 
    USING biscuit(subject, description, customer_name);

SELECT * FROM tickets 
WHERE subject LIKE '%refund%'
  AND customer_name LIKE 'John%'
  AND status = 'open';

4. Code Search / Documentation

-- Search code repositories
CREATE INDEX idx_files ON code_files 
    USING biscuit(filename, content, author);

SELECT * FROM code_files 
WHERE filename LIKE '%.py'
  AND content LIKE '%def%parse%json%'
  AND author LIKE 'team-%';

5. Analytics with Aggregates

-- Fast COUNT queries (no sorting overhead)
CREATE INDEX idx_events ON events 
    USING biscuit(event_type, user_agent, referrer);

SELECT COUNT(*) FROM events 
WHERE event_type LIKE 'click%'
  AND user_agent LIKE '%Mobile%'
  AND referrer LIKE '%google%';

Configuration

Build Options

Enable CRoaring for better performance:

Index Options

Currently, Biscuit doesn’t expose tunable options. All optimizations are automatic.


Limitations and Trade-offs

What Biscuit Does NOT Support

  1. Regular expressions - Only LIKE / ILIKE patterns with % and _
  2. Locale-specific collations - String comparisons are byte-based
  3. Amcanorder = false - Cannot provide ordered scans directly (but see below)

ORDER BY + LIMIT Behavior

Biscuit doesn’t support ordered index scans (amcanorder = false), BUT:

PostgreSQL’s planner handles this efficiently: sql SELECT * FROM table WHERE col LIKE '%pattern%' ORDER BY score LIMIT 10;

Execution plan: Limit -> Sort (cheap, small result set) -> Biscuit Index Scan (fast filtering)

Why this works: - Biscuit filters candidates extremely fast - Result set is small after filtering - Sorting 100-1000 rows in memory is negligible (<1ms) - Net result: Still much faster than pg_trgm with recheck overhead in many cases

Memory Usage

Biscuit stores bitmaps in memory: - Use REINDEX to rebuild if index grows too large

Write Performance

  • INSERT: Similar to B-tree (must update bitmaps)
  • UPDATE: Two operations (remove old, insert new)
  • DELETE: Marks as tombstone, batch cleanup at threshold

Comparison with pg_trgm

Feature Biscuit pg_trgm (GIN)
Wildcard patterns ✔ Native ✔ Approximate
Recheck overhead ✔ None (deterministic) ✗ Required
Regex support ✗ No ✔ Yes
Similarity search ✗ No ✔ Yes
ILIKE support ✔ Full ✔ Native

When to use Biscuit: - Wildcard-heavy LIKE / ILIKE queries (%, _) - Multi-column pattern matching - Need exact results (no false positives) - COUNT(*) / aggregate queries - High query volume, can afford memory

When to use pg_trgm: - Fuzzy/similarity search (word <-> pattern) - Regular expressions - Memory-constrained environments - Write-heavy workloads


Development

Build from Source

git clone https://github.com/Crystallinecore/biscuit.git
cd biscuit

# Development build with debug symbols
make clean
CFLAGS="-g -O0 -DDEBUG" make

# Run tests
make installcheck

# Install
sudo make install

Testing

# Unit tests
make installcheck

# Manual testing
psql -d testdb

CREATE EXTENSION biscuit;

-- Create test table
CREATE TABLE test (id SERIAL, name TEXT);
INSERT INTO test (name) VALUES ('hello'), ('world'), ('test');

-- Create index
CREATE INDEX idx_test ON test USING biscuit(name);

-- Test queries
EXPLAIN ANALYZE SELECT * FROM test WHERE name LIKE '%ell%';

Debugging

Enable PostgreSQL debug logging:

SET client_min_messages = DEBUG1;
SET log_min_messages = DEBUG1;

-- Now run queries to see Biscuit's internal logs
SELECT * FROM test WHERE name LIKE '%pattern%';

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing)
  3. Make your changes with tests
  4. Submit a pull request

Areas for Contribution

  • [ ] Implement amcanorder for native sorted scans
  • [ ] Add statistics collection for better cost estimation
  • [ ] Support for more data types
  • [ ] Parallel index build
  • [ ] Index compression options

License

MIT License - See LICENSE file for details.


Author

Sivaprasad Murali - Email: sivaprasad.off@gmail.com - GitHub: @Crystallinecore


Acknowledgments

  • The PostgreSQL community for the extensible index access method (AM) framework
  • B-tree and pg_trgm indexes that shaped the design space for pattern matching in PostgreSQL
  • The CRoaring library for efficient compressed bitmap operations

Support


Happy pattern matching! Grab a biscuit 🍪 when others feel half-baked!