# kmeans 1.0.0

## Synopsis

``````SELECT kmeans(ARRAY[x, y, z], 10) OVER (), * FROM samples;
SELECT kmeans(ARRAY[x, y, z], 2, ARRAY[0.5, 0.5, 1.0, 1.0]) OVER (), * FROM samples;
SELECT kmeans(ARRAY[x, y, z], 2, ARRAY[ARRAY[0.5, 0.5], ARRAY[1.0, 1.0]]) OVER (PARTITION BY group_key), * FROM samples
``````

## Description

This module implements K-means clustering algorithm as a user-defined window function in PostgreSQL. K-means clustering is a simple way to classify data set by N-dimension vector.

It is known that the algorithm depends on how to define initial mean vectors. So the module provides two singature, one for auto-initialization and the other for user-given initial vectors.

The `kmeans` function calculates and returns class number which starts from 0, by using input vectors. The first argument is the vector, represented by array of float8. You must give the same-length 1-dimension array across all the input rows. The second argument is an integer K in "K-means", the number of class you want to partition. The third argument is optional, an array of vector which represents initial mean vectors. Since the algorithm is sensitive with initial mean vectors, and although the function calculates them by input vectors in the first form, you may sometimes want to give fixed mean vectors. The vectors can be passed as 1-d array or 2-d array of float8. In case of 1-d, the length of it must match k * lengthof(vector).

If the third argument, initial centroids, is not given, it scans all the input vectors and stores min/max for each element of vector. Then decide initial hypothetical vector elements by the next formula:

``````init = (max - min) * (i + 1) / (k + 1) + min
``````

where `i` varies from 0 to k-1. This means that the vector elements are decided as the liner interpolation from min to max divided by k. Then one of the input vectors nearest to this hypothetical vector is picked up as the centroid. Note that input vector is not picked more than twice as far as possible, to gain good result.

See input/kmeans.source as an example to call the function.

## History

### v1.1.0 (2011-07-22)

• Fix NULL input case, thanks to the report from Mike Toews
• Improve initialization of centroids

### v1.0.0 (2011-05-20)

• Initial version

## Support

This library is stored in an open GitHub repository. Feel free to fork and contribute! Please file bug reports via GitHub Issues.