pg_kazsearch 1.0.0

This Release

pg_kazsearch 1.0.0

Date

2026-04-02

Status

Stable

Latest Stable

pg_kazsearch 2.0.0 — 2026-04-03

Other Releases

Abstract

Kazakh full-text search dictionary for PostgreSQL

Description

The first PostgreSQL text search extension for the Kazakh language. BFS suffix stripping with vowel harmony validation, consonant mutation repair, 21K-stem lexicon from Apertium-kaz, and stopword filtering.

Released By

darkhanakh

License

MIT

Resources

Special Files

Tags

Extensions

pg_kazsearch 1.0.0: Kazakh stemmer and stopword dictionary for PostgreSQL FTS

README

pg_kazsearch

The first PostgreSQL full-text search extension for the Kazakh language.

Kazakh is heavily agglutinative: a single word like мектептерімізде carries plural, possessive, and locative suffixes that must all be stripped to reach the root мектеп. No existing PostgreSQL or Elasticsearch analyzer handles this. pg_kazsearch fills that gap with a C extension that plugs directly into PostgreSQL’s text search pipeline.

What it does

-- Install and configure
CREATE EXTENSION pg_kazsearch;
CREATE TEXT SEARCH CONFIGURATION kazakh_cfg (PARSER = pg_catalog.default);
ALTER TEXT SEARCH CONFIGURATION kazakh_cfg
    ALTER MAPPING FOR word, hword, hword_part
    WITH pg_kazsearch_stop, pg_kazsearch_dict, simple;

-- Index your table
ALTER TABLE articles ADD COLUMN fts_vector tsvector
    GENERATED ALWAYS AS (
        setweight(to_tsvector('kazakh_cfg', title), 'A') ||
        setweight(to_tsvector('kazakh_cfg', body), 'B')
    ) STORED;
CREATE INDEX idx_fts ON articles USING GIN (fts_vector);

-- Search in Kazakh
SELECT title FROM articles
WHERE fts_vector @@ websearch_to_tsquery('kazakh_cfg', 'президенттің жарлығы')
ORDER BY ts_rank_cd(fts_vector, websearch_to_tsquery('kazakh_cfg', 'президенттің жарлығы')) DESC
LIMIT 10;

The stemmer normalizes both query and document terms so президенттің (president’s) matches президент, мектептерімізде matches мектеп, and өзгеруі matches өзгеру.

Stemmer quality

Tested on 2,999 Kazakh news articles (tengrinews.kz) with 9,048 evaluation queries:

Metric	pg_kazsearch	pg_trgm (trigram)
Recall@10	0.784	0.635
MRR@10	0.712	0.566
nDCG@10	0.729	0.582
Query latency	0.5 ms	1.4 ms

pg_kazsearch beats trigram by +16 percentage points on Recall@10.

Stemmer examples

Input	Output	Morphology stripped
мектептерімізде	мектеп	plural + possessive + locative
президенттерінің	президент	plural + possessive + genitive
өзгеруі	өзгеру	verbal noun possessive
берді	бер	past tense
экономикалық	экономика	derivational adjective
алматыға	алматы	dative case (proper noun)
көмек	көмек	protected (lexicon-known root)

Architecture

The extension consists of:

BFS suffix stripper (kaz_explore.c) — breadth-first search over layered suffix rules (predicate, case, possessive, plural, derivational for nouns; person, tense, negation, voice for verbs), with vowel harmony validation and phonological guards
Penalty scoring (kaz_explore.c) — candidates scored by syllable count, suffix weakness, derivational depth, and lexicon hits to pick the best stem
Lexicon (kaz_stems.dict) — 21,863 POS-tagged stems extracted from Apertium-kaz’s morphological transducer, filtered to root forms only (nouns, verbs, adjectives, place names)
Stopwords (kaz_stopwords.stop) — 53 Kazakh function words filtered before stemming
Vowel harmony (kaz_text.c) — back/front vowel classification with glide exclusion (у/и/ю treated as consonants for harmony) and tail-based fallback for loanwords
Stem repair (kaz_explore.c, pg_kazsearch.c) — consonant mutation reversal (б→п, г→к, ғ→қ), vowel elision restoration, and lexicon-based vowel append for proper nouns

Quick start

# Prerequisites: Docker
git clone https://github.com/darkhanakh/pg-kazsearch.git
cd pg-kazsearch

make up          # start PostgreSQL with the extension
make reload      # build, install, and configure kazakh_cfg
make test-ext    # smoke test stemmer + tsvector

# Load the evaluation corpus (optional)
python3 eval/load_corpus.py --input data/corpus/articles.jsonl

# Run the evaluation
python3 eval/run_eval.py --trgm-sample 500

Project structure

Directory	Contents
`src/pg_kazsearch/`	C extension: stemmer dictionary, suffix rules, BFS explorer, text utilities, lexicon loader
`data/tsearch_data/`	Stem dictionary (`kaz_stems.dict`) and stopword list (`kaz_stopwords.stop`)
`scripts/`	`build_lexicon.py` — extracts POS-tagged lemmas from Apertium-kaz
`eval/`	Evaluation pipeline: scraper, corpus loader, query generator, FTS vs trigram eval
`docker/`	Dockerfile and init SQL for local development
`prototype/`	Python stemmer prototypes (v1-v3) used during research phase
`benchmark/`	Performance and parity benchmarks

Lexicon

The stem dictionary is built from Apertium-kaz, a linguistically vetted morphological transducer for Kazakh. Only entries with explicit POS continuation classes are included:

N1/N5/N6 — common nouns (13,900+)
V-TV/V-IV — transitive/intransitive verbs (3,500+)
A1/A2 — base adjectives (3,200+)
NP-TOP/NP-ORG — place names and organizations (1,800+)
ADV/NUM — adverbs and numerals (900+)

Derived adjectives (A3/A4), personal names (NP-ANT/NP-COG), and inflected forms are excluded to keep the dictionary clean for stemmer disambiguation.

Rebuild with:

python3 scripts/build_lexicon.py

References

Krippes, K.A. (1993). Kazakh (Qazaq-) Grammatical Sketch with Affix List. ERIC.
Washington, J., Salimzyanov, I., Tyers, F. (2014). Finite-state morphological transducers for three Kypchak languages. LREC.
Makhambetov, O. et al. (2015). Data-driven morphological analysis and disambiguation for Kazakh. CICLing.
Tolegen, G., Toleu, A., Mussabayev, R. (2022). A Finite State Transducer Based Morphological Analyzer for Kazakh Language. IEEE UBMK.

License

Code: MIT
Lexicon data derived from Apertium-kaz (GPL-3.0) and KazNU morphology resources (CC BY-SA).

PGXN

PostgreSQL Extension Network

pg_kazsearch 1.0.0

Extensions

README

Contents

pg_kazsearch

What it does

Stemmer quality

Stemmer examples

Architecture

Quick start

Project structure

Lexicon

References

License