biscuit 1.0.6

This Release
biscuit 1.0.6
Date
Status
Stable
Other Releases
Abstract
A fast bitmap-based pattern matcher for PostgreSQL
Description
Biscuit provides efficient bitmap-based searching with deterministic results.
Released By
crystallinecore
License
MIT
Resources
Special Files
Tags

Extensions

biscuit 1.0.6

README

BISCUIT: Bitmap-Indexed String Comparison Using Intelligent Traversal

A PostgreSQL extension for accelerating wildcard pattern matching queries using bitmap indexing.

Updates in 1.0.6

  • Introduced a new bitmap that stores records having length greater than ‘n’, to reduce computational cost.

Technical Overview

Problem Statement

Traditional database indexes (B-tree, GiST, GIN) exhibit suboptimal performance for wildcard pattern matching queries, particularly when wildcards appear at the beginning of patterns. A query such as SELECT * FROM table WHERE column LIKE '%pattern%' typically requires a sequential scan.

Solution Architecture

BISCUIT implements a multi-dimensional bitmap index structure that decomposes strings into positional character mappings. Each character at each position is tracked using compressed bitmaps (Roaring Bitmaps), enabling efficient intersection operations for pattern matching.

Core Principle

Instead of scanning strings sequentially, BISCUIT:

  1. Pre-indexes all character positions (up to 256 characters)
  2. Uses bitmap intersections to find candidate records
  3. Filters results through tombstone bitmap (for deleted records)
  4. Returns matching records

Reality Check: This trades initial indexing time and memory for faster queries. It’s not magic—you’re paying upfront to make searches faster later.

Core Index Structures

  1. Positional Character Index: Maps each character to position-bitmap pairs

    • Forward positions: {0, 1, 2, ..., 255} from string start
    • Reverse positions: {-1, -2, -3, ..., -256} from string end
  2. Character Existence Cache: Union of all positional bitmaps per character for containment queries

  3. Length Index: Bitmap array indexed by string length

  4. Primary Key Hash Table: O(1) lookup structure mapping primary keys to internal indices

  5. Lazy Deletion Layer: Tombstone bitmap tracking deleted records with deferred cleanup


Key Features

  • Bitmap-based pattern matching: Optimized for wildcard searches
  • Hash table primary key lookup: Fast record access by primary key
  • Lazy deletion: Deferred cleanup of deleted records using tombstones
  • Incremental updates: Optimized path for similar string updates
  • Automatic maintenance: Optional triggers to keep index synchronized
  • Roaring Bitmap support: Uses CRoaring library when available

Limitations

  • In-memory only: Entire index must fit in RAM
  • Single column: Indexes one column at a time
  • Session-scoped: Index is rebuilt per session (not persistent)
  • String data only: Designed for text/varchar columns
  • Memory overhead: Bitmap indices can be memory-intensive for large datasets

Installation and Requirements

System Requirements

  • PostgreSQL Version: 11.0 or higher
  • Operating System: Linux, macOS, Windows (with MinGW)
  • Architecture: x86_64, ARM64
  • Memory: Minimum 512MB available RAM per index
  • Compiler: GCC 4.8+, Clang 3.9+, or MSVC 2017+

Optional Dependencies

  • CRoaring Library: For optimized Roaring Bitmap operations (recommended)
    • Installation: apt-get install libroaring-dev (Debian/Ubuntu)
    • Falls back to built-in implementation if unavailable

Installation

Prerequisites

  • PostgreSQL 16 or higher
  • C compiler (gcc/clang)
  • (Optional) CRoaring library for optimized bitmap operations

Build and Install

# With Roaring Bitmap support (recommended)
make HAVE_ROARING=1
sudo make install HAVE_ROARING=1

# Without Roaring (uses fallback implementation)
make
sudo make install

Enable Extension

CREATE EXTENSION biscuit;

Quick Start

-- One-command setup: builds index, creates query function, enables triggers
SELECT biscuit_setup('users', 'email', 'id');

-- Query with patterns
SELECT * FROM biscuit_match('%@gmail.com');
SELECT * FROM biscuit_match('admin%');
SELECT * FROM biscuit_match('%test%');

-- Get count only (faster than full query)
SELECT biscuit_match_count('%@gmail.com');

-- Check index status
SELECT biscuit_index_status();

Function Reference

Setup Functions

biscuit_setup(table_name, column_name, pk_column_name)

Complete initialization in one step. Builds the bitmap index, creates a type-safe biscuit_match() function, and enables automatic triggers.

SELECT biscuit_setup('users', 'email', 'id');

Parameters:

  • table_name (text): Name of the table to index
  • column_name (text): Name of the column to index (must be text/varchar)
  • pk_column_name (text): Name of the primary key column (default: ‘id’)

Returns: Success message with usage instructions

biscuit_build_index(table_name, column_name, pk_column_name)

Low-level function to build the bitmap index only, without creating query functions or triggers.

SELECT biscuit_build_index('users', 'email', 'id');

Returns: Boolean (true on success)

Query Functions

biscuit_match(pattern)

Returns complete table rows matching the pattern. This function is created by biscuit_setup() and is strongly-typed to match your table schema.

-- Get all matching rows
SELECT * FROM biscuit_match('%@gmail.com');

-- Filter and sort results
SELECT id, email FROM biscuit_match('%test%') 
WHERE created_at > '2024-01-01' 
ORDER BY id LIMIT 10;

Parameters:

  • pattern (text): Wildcard pattern using % (zero or more chars) and _ (exactly one char)

Returns: TABLE matching the indexed table’s schema

biscuit_match_count(pattern)

Returns only the count of matching records. This is more efficient than counting rows from biscuit_match().

SELECT biscuit_match_count('%@gmail.com');

Parameters:

  • pattern (text): Wildcard pattern

Returns: INTEGER count of matches

biscuit_match_keys(pattern)

Returns primary key and indexed value pairs as text. Useful for joining or when you only need these fields.

-- Get keys and values
SELECT * FROM biscuit_match_keys('%@gmail.com');

-- Cast primary key back to original type
SELECT pk::integer AS id, value FROM biscuit_match_keys('%pattern%');

Parameters:

  • pattern (text): Wildcard pattern

Returns: TABLE(pk text, value text)

Monitoring Functions

biscuit_index_status()

Returns detailed status information about the current index including table/column info, record counts, memory usage, CRUD statistics, and lazy deletion status.

SELECT biscuit_index_status();

Returns: TEXT formatted status report

biscuit_get_active_count()

Returns the number of live (non-deleted) records in the index.

SELECT biscuit_get_active_count();

Returns: INTEGER

biscuit_get_free_slots()

Returns the number of reusable slots from deleted records.

SELECT biscuit_get_free_slots();

Returns: INTEGER

biscuit_get_tombstone_count()

Returns the number of soft-deleted records awaiting cleanup.

SELECT biscuit_get_tombstone_count();

Returns: INTEGER

Maintenance Functions

biscuit_cleanup()

Manually triggers tombstone cleanup. Normally cleanup happens automatically when the tombstone threshold is reached, but you can force it with this function.

SELECT biscuit_cleanup();

Returns: TEXT cleanup report

biscuit_enable_triggers()

Activates automatic index maintenance triggers on the indexed table.

SELECT biscuit_enable_triggers();

Returns: TEXT confirmation message

biscuit_disable_triggers()

Disables automatic index updates. Useful for bulk operations to avoid per-row overhead. Remember to rebuild the index afterward.

SELECT biscuit_disable_triggers();

-- Perform bulk operations
INSERT INTO users ...;
UPDATE users ...;

-- Rebuild index
SELECT biscuit_build_index('users', 'email', 'id');
SELECT biscuit_enable_triggers();

Returns: TEXT confirmation message

biscuit_version()

Returns the version of the BISCUIT extension.

SELECT biscuit_version();

Returns: TEXT version string

Pattern Syntax

BISCUIT supports SQL LIKE pattern syntax:

  • % - Matches zero or more characters
  • _ - Matches exactly one character

Pattern Examples

-- Prefix match
SELECT * FROM biscuit_match('john%');           -- Starts with 'john'

-- Suffix match
SELECT * FROM biscuit_match('%@gmail.com');     -- Ends with '@gmail.com'

-- Contains match
SELECT * FROM biscuit_match('%test%');          -- Contains 'test'

-- Exact length
SELECT * FROM biscuit_match('____');            -- Exactly 4 characters

-- Single wildcard
SELECT * FROM biscuit_match('a_c');             -- 'a', any char, 'c'

-- Complex patterns
SELECT * FROM biscuit_match('admin%2024');      -- Starts 'admin', ends '2024'
SELECT * FROM biscuit_match('%user%@%');        -- Contains 'user', then '@'

CRUD Operations

Once triggers are enabled, the index automatically maintains itself:

-- INSERT: New records are automatically indexed
INSERT INTO users (id, email) VALUES (999, 'new@example.com');

-- UPDATE: Index is updated (uses incremental update when possible)
UPDATE users SET email = 'updated@example.com' WHERE id = 1;

-- DELETE: Records are soft-deleted with tombstones (O(1) operation)
DELETE FROM users WHERE id = 2;

-- Queries immediately reflect all changes
SELECT * FROM biscuit_match('%example.com');

Incremental Updates

The extension uses an incremental update optimization when:

  • Old and new values have the same length
  • String is at least 3 characters
  • Fewer than 20% of characters changed
  • Maximum 3 characters changed

This provides faster updates for similar strings (e.g., typo corrections, case changes).

Lazy Deletion

Deletions use tombstone marking for O(1) performance:

  1. Record is marked as tombstoned (bitmap operation)
  2. Primary key is removed from hash table
  3. Slot is added to free list for reuse
  4. Actual cleanup happens in batches when threshold is reached

Performance Considerations

When to Use BISCUIT

Good use cases:

  • Frequent wildcard searches on a string column
  • Dataset size fits comfortably in available memory
  • Pattern matching is a bottleneck in your application
  • Mixed CRUD operations with pattern queries

Not recommended for:

  • Very large datasets (multi-GB text columns)
  • Infrequent pattern matching queries
  • Exact match lookups (use regular B-tree index)
  • Limited memory environments

Bulk Operations

For bulk inserts/updates/deletes, temporarily disable triggers:

SELECT biscuit_disable_triggers();

-- Perform bulk operations
COPY users FROM 'data.csv';
UPDATE users SET email = lower(email);
DELETE FROM users WHERE inactive = true;

-- Rebuild and re-enable
SELECT biscuit_build_index('users', 'email', 'id');
SELECT biscuit_enable_triggers();

Memory Usage

Monitor memory consumption:

SELECT biscuit_index_status();  -- Shows memory usage

Memory usage depends on:

  • Number of records
  • String length distribution
  • Character diversity
  • Number of unique lengths

Tombstone Management

Monitor tombstone accumulation:

-- Check tombstone count
SELECT biscuit_get_tombstone_count();

-- Manual cleanup if needed
SELECT biscuit_cleanup();

Automatic cleanup triggers when tombstone count reaches threshold (default: 1000).

Architecture

Index Structure

BISCUIT maintains several bitmap indices:

  • Position index: Character presence at each position (forward)
  • Negative position index: Character presence from end (backward)
  • Length index: Records grouped by string length
  • Character cache: All positions where each character appears
  • Primary key hash table: O(1) lookup from PK to index position

Lazy Deletion

The lazy deletion system:

  • Marks deleted records in a tombstone bitmap
  • Filters tombstones during query execution
  • Batches actual cleanup operations
  • Reuses deleted slots for new inserts

Query Optimization

Pattern matching is optimized through:

  • Fast path for simple patterns (single character, prefix, suffix)
  • Length filtering to reduce candidate set
  • Windowed position matching for complex patterns
  • Bitmap intersection/union operations

Limitations and Caveats

  1. Memory requirement: Entire index must fit in RAM
  2. Session scope: Index is not persistent across PostgreSQL restarts
  3. Single column: One index per column (no composite indices)
  4. No crash recovery: Index must be rebuilt after crashes
  5. Character limit: Strings longer than 256 characters are truncated
  6. No case-insensitive: Pattern matching is case-sensitive
  7. Primary key required: Must have a unique primary key column

Troubleshooting

Index not found

ERROR: No index found

Solution: Build the index first:

SELECT biscuit_setup('table_name', 'column_name', 'pk_column');

Out of memory

ERROR: out of memory

Solution: Dataset too large for available memory. Consider:

  • Increasing PostgreSQL memory limits
  • Indexing a subset of data
  • Using partitioned tables

Slow queries after many deletes

Solution: High tombstone count may slow queries. Trigger cleanup:

SELECT biscuit_cleanup();

Trigger conflicts

Solution: If you have other triggers, ensure proper ordering or disable BISCUIT triggers during maintenance windows.

License

BISCUIT is released under the PostgreSQL License.

Version

Current version: 1.0.6-Biscuit

Use SELECT biscuit_version(); to check your installed version.

Summary

BISCUIT makes wildcard queries faster by trading memory and build time for query performance.

It’s not magic—it’s a bitmap index that:

  • Pre-computes character positions
  • Uses fast bitmap operations instead of string scanning
  • Still requires O(n) time to return n results
  • Requires memory proportional to your data size

Contributors

BISCUIT is developed and maintained by Sivaprasad Murali .

Support and Contact

Issues: https://github.com/crystallinecore/biscuit/issues

Discussions: https://github.com/crystallinecore/biscuit/discussions

When pg_trgm feels half-baked, grab a BISCUIT 🍪