Contents
kazsearch — Kazakh Stemmer for Elasticsearch
Analysis plugin that adds a kazsearch_stem token filter for Kazakh full-text search.
BFS suffix-stripping with vowel harmony enforcement — the first Kazakh stemmer for Elasticsearch.
Compatibility
| Plugin version | Elasticsearch | Java | Architecture |
|---|---|---|---|
| 0.1.0 | 8.17.x | 21+ | linux/aarch64, linux/x86_64, darwin/aarch64 |
Quick Start
The install script auto-downloads the latest release from GitHub:
# Download latest release + install into local ES:
./elastic/install.sh
# Or a specific version:
./elastic/install.sh --version 2.1.0
# Or Docker (no ES install needed):
./elastic/install.sh --docker
The release zip is self-contained — Java jar + native Rust library for both x86_64 and aarch64.
Install Methods
Method 1: Auto-download (recommended)
# Downloads latest from GitHub Releases, installs into ES_HOME
./elastic/install.sh
sudo systemctl restart elasticsearch
Method 2: Manual download
Grab analysis-kazsearch-<version>.zip from GitHub Releases, then:
bin/elasticsearch-plugin install file:///path/to/analysis-kazsearch-2.1.0.zip
sudo systemctl restart elasticsearch
Method 3: Docker
# From release:
./elastic/install.sh --docker --version 2.1.0
# Or from source:
./elastic/install.sh --docker
# Then run:
docker run -d --name es-kazsearch \
-p 9200:9200 \
-e discovery.type=single-node \
-e xpack.security.enabled=false \
kazsearch-elastic:2.1.0
Method 4: Build from source
Requirements: Rust toolchain, Java 21, Gradle.
./elastic/install.sh --build
sudo systemctl restart elasticsearch
Usage
1. Create an index with the Kazakh analyzer
PUT /articles
{
"settings": {
"analysis": {
"filter": {
"kaz_stem": { "type": "kazsearch_stem" }
},
"analyzer": {
"kazakh": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "kaz_stem"]
}
}
}
},
"mappings": {
"properties": {
"title": { "type": "text", "analyzer": "kazakh" },
"body": { "type": "text", "analyzer": "kazakh" }
}
}
}
2. Test the analyzer
curl -X POST "localhost:9200/articles/_analyze" \
-H 'Content-Type: application/json' \
-d '{"analyzer": "kazakh", "text": "мектептерде оқушылар"}'
Result: ["мектеп", "оқушы"] — suffixes stripped, stems returned.
3. Index documents
curl -X POST "localhost:9200/articles/_doc" \
-H 'Content-Type: application/json' \
-d '{"title": "Мектептерде жаңа оқу жылы", "body": "Оқушылар жаңа оқулықтарды алды."}'
4. Search with any inflected form
# Searching "мектептегі" (at school) finds docs containing
# "мектептерде" (in schools), "мектепте" (at school), etc.
curl -X POST "localhost:9200/articles/_search" \
-H 'Content-Type: application/json' \
-d '{"query": {"match": {"body": "мектептегі оқушыларды"}}}'
What It Does
Kazakh is agglutinative — a single word can stack 5–6 suffixes:
мектептерінде = мектеп + тер (plural) + і (possessive) + нде (locative)
"in their schools"
The plugin stems all inflected forms to a common root, so search works regardless of which grammatical form appears in the document or query:
| Query form | Document form | Both stem to |
|---|---|---|
| мектептерде | мектептің | мектеп |
| оқушыларды | оқушылар | оқушы |
| дәрігерлерге | дәрігерлердің | дәрігер |
| технологияларды | технологиясын | технология |
Recommended Index Pattern
For best results, index with both a stemmed and a standard (unstemmed) sub-field:
{
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "kazakh",
"fields": {
"exact": { "type": "text", "analyzer": "standard" }
}
}
}
}
}
This lets you boost exact matches while still getting stemmed recall:
{
"query": {
"multi_match": {
"query": "Қазақстан экономикасы",
"fields": ["title^3", "title.exact^5", "body"]
}
}
}
Architecture
┌─────────────────────────────────────────────┐
│ Elasticsearch 8.17 │
│ │
│ ┌──────────────────────────────────────┐ │
│ │ analysis-kazsearch plugin (Java) │ │
│ │ │ │
│ │ KazakhStemTokenFilter │ │
│ │ └─► KazakhStemmerNative (JNI) │ │
│ │ └─► libkazsearch_elastic.so │ │
│ │ └─► kazsearch-core │ │
│ │ (pure Rust) │ │
│ └──────────────────────────────────────┘ │
└─────────────────────────────────────────────┘
The stemmer logic lives in kazsearch-core (pure Rust, no ES/PG dependencies). The same core powers both the PostgreSQL extension and this Elasticsearch plugin.
Troubleshooting
Plugin fails to load with UnsatisfiedLinkError:
The native library for your platform is missing from the zip. Rebuild with ./scripts/build_elastic_native.sh on a machine matching your ES server’s OS/arch, then gradle bundlePlugin.
No stemming effect (tokens unchanged):
Make sure your index analyzer uses the kazsearch_stem filter. Test with _analyze API first.
ES won’t start after install:
Check ES version matches (8.17.x). Check elasticsearch.log for the exact error. Run bin/elasticsearch-plugin remove analysis-kazsearch to uninstall.