pg_kazsearch

This Release
pg_kazsearch 2.0.0
Date
Status
Stable
Other Releases
Abstract
Kazakh full-text search dictionary for PostgreSQL
Description
The first PostgreSQL text search extension for the Kazakh language. BFS suffix stripping with vowel harmony validation, consonant mutation repair, 21K-stem lexicon from Apertium-kaz, and stopword filtering. Built with pgrx.
Released By
darkhanakh
License
LGPL 3.0
Resources
Special Files
Tags

Extensions

pg_kazsearch 2.0.0
Kazakh stemmer and stopword dictionary for PostgreSQL FTS

Documentation

bench_tokens
bench_tokens
README
Beads - AI-Native Issue Tracking
c_stem_output_5k
c_stem_output_5k
AGENTS
Repository Guidelines
c_stem_output
c_stem_output
README
Legacy C Extension
bench_articles
bench_articles

README

pg_kazsearch

License: LGPL v3 PostgreSQL: 16–18

The first PostgreSQL full-text search extension for the Kazakh language.

Kazakh is heavily agglutinative: a single word like мектептерімізде carries plural, possessive, and locative suffixes that must all be stripped to reach the root мектеп. No existing PostgreSQL or Elasticsearch analyzer handles this. pg_kazsearch fills that gap with a Rust extension (via pgrx) that plugs directly into PostgreSQL’s text search pipeline.

CREATE EXTENSION pg_kazsearch;

SELECT to_tsvector('kazakh_cfg', 'президенттің жарлығы');
-- 'жарлық':2 'президент':1

Install

Pre-built package (Debian/Ubuntu)

Download the .deb for your PostgreSQL version from GitHub Releases:

# Example: PostgreSQL 18 on amd64
wget https://github.com/darkhanakh/pg-kazsearch/releases/latest/download/postgresql-18-pg-kazsearch_0.1.0_amd64.deb
sudo dpkg -i postgresql-18-pg-kazsearch_0.1.0_amd64.deb

Then in psql:

CREATE EXTENSION pg_kazsearch;

Docker

Use the pre-built image as a drop-in replacement for postgres:

# docker-compose.yml
services:
  db:
    image: ghcr.io/darkhanakh/pg-kazsearch:18

Or add to your existing Dockerfile:

FROM ghcr.io/darkhanakh/pg-kazsearch:18 AS kazsearch
FROM postgres:18

COPY --from=kazsearch /usr/share/postgresql/18/extension/pg_kazsearch* /usr/share/postgresql/18/extension/
COPY --from=kazsearch /usr/lib/postgresql/18/lib/pg_kazsearch* /usr/lib/postgresql/18/lib/
COPY --from=kazsearch /usr/share/postgresql/18/tsearch_data/kaz_* /usr/share/postgresql/18/tsearch_data/

From source

# Requires: Rust toolchain, cargo-pgrx, postgresql-server-dev
cargo install --locked cargo-pgrx --version "=0.17.0"
cargo pgrx init --pg18 $(which pg_config)

git clone https://github.com/darkhanakh/pg-kazsearch.git
cd pg-kazsearch
cargo pgrx install --release -p pg_kazsearch

# Install lexicon and stopwords
cp data/tsearch_data/kaz_stems.dict $(pg_config --sharedir)/tsearch_data/
cp data/tsearch_data/kaz_stopwords.stop $(pg_config --sharedir)/tsearch_data/

Usage

The extension creates everything automatically — a text search template, dictionaries, and a ready-to-use configuration called kazakh_cfg:

CREATE EXTENSION pg_kazsearch;

-- Stem individual words
SELECT ts_lexize('pg_kazsearch_dict', 'алмаларымыздағы');
-- {алма}

-- Build tsvectors
SELECT to_tsvector('kazakh_cfg', 'мектептеріміздегі оқушылардың');
-- 'мектеп':1 'оқушы':2

-- Add FTS to a table
ALTER TABLE articles ADD COLUMN fts tsvector
    GENERATED ALWAYS AS (
        setweight(to_tsvector('kazakh_cfg', title), 'A') ||
        setweight(to_tsvector('kazakh_cfg', body), 'B')
    ) STORED;

CREATE INDEX idx_fts ON articles USING GIN (fts);

-- Search
SELECT title FROM articles
WHERE fts @@ websearch_to_tsquery('kazakh_cfg', 'президенттің жарлығы')
ORDER BY ts_rank_cd(fts, websearch_to_tsquery('kazakh_cfg', 'президенттің жарлығы')) DESC
LIMIT 10;

Tuning weights

Penalty weights are tunable at runtime without restarting PostgreSQL:

ALTER TEXT SEARCH DICTIONARY pg_kazsearch_dict (w_deriv = 3.5, w_short_char = 100.0);

Benchmarks

Tested on 2,999 Kazakh news articles with 9,048 evaluation queries:

Retrieval Quality

Relative Improvement

Query Latency

Metric pg_kazsearch pg_trgm Improvement
Recall@10 0.784 0.635 +23%
MRR@10 0.712 0.566 +26%
nDCG@10 0.729 0.582 +25%
Query latency 0.5 ms 1.4 ms 2.8x faster

Stemmer examples

Input Output Stripped
мектептерімізде мектеп plural + possessive + locative
президенттерінің президент plural + possessive + genitive
өзгеруі өзгеру verbal noun possessive
берді бер past tense
экономикалық экономика derivational adjective

Architecture

┌────────────────────────────────────────────────────┐
│                  Cargo Workspace                   │
│                                                    │
│  core/         Pure Rust stemmer (no PG deps)      │
│  pg_ext/       pgrx PostgreSQL extension           │
│  cli/          CLI tool (kazsearch)                │
│  elastic/      Elasticsearch plugin (planned)      │
└────────────────────────────────────────────────────┘

The stemmer algorithm:

  • BFS suffix stripper — breadth-first search over layered morphological rules (predicate, case, possessive, plural, derivational for nouns; person, tense, negation, voice for verbs), with vowel harmony validation
  • Penalty scoring — candidates scored by syllable count, suffix weakness, derivational depth, and lexicon hits
  • Lexicon — 21,863 POS-tagged stems from Apertium-kaz for overstemming protection
  • Stem repair — consonant mutation reversal (б→п, г→к, ғ→қ), vowel elision restoration, lexicon-based vowel append

CLI

The kazsearch CLI works standalone without PostgreSQL:

cargo build -p kazsearch-cli --release

# Stem a word
kazsearch stem алмаларымыздағы
# алмаларымыздағы    алма

# Morphological analysis
kazsearch analyze мектептеріміздегі

# Benchmark
kazsearch bench wordlist.txt

# Validate lexicon
kazsearch lexicon validate data/tsearch_data/kaz_stems.dict

Development

# Start dev environment
just up

# Build and install extension into running container
just build

# Reload extension (DROP + CREATE)
just reload

# Run core tests
just test-core

# Smoke test via SQL
just test-ext

# Build CLI
just cli

Contributing

  1. Fork the repo and create a feature branch
  2. Make your changes — stemmer logic lives in core/src/, extension glue in pg_ext/src/lib.rs
  3. Run cargo test -p kazsearch-core --test stem_tests to verify stemmer correctness
  4. Run just up && just reload && just test-ext to verify the extension works end-to-end
  5. Open a PR

Key things to know:

  • Penalty weights in core/src/explore.rs are empirically tuned via CMA-ES — changing one can affect many test cases
  • Layer guards encode real morphotactic constraints, not heuristics
  • Vowel harmony (back/front) is mandatory for suffix validation

References

  • Krippes, K.A. (1993). Kazakh (Qazaq-) Grammatical Sketch with Affix List. ERIC.
  • Washington, J., Salimzyanov, I., Tyers, F. (2014). Finite-state morphological transducers for three Kypchak languages. LREC.
  • Makhambetov, O. et al. (2015). Data-driven morphological analysis and disambiguation for Kazakh. CICLing.

License