2018-05-26 12:00:20 +08:00
|
|
|
# kmeans
|
2018-05-26 12:37:17 +08:00
|
|
|
|
2018-05-26 14:43:47 +08:00
|
|
|
k-means clustering algorithm implementation written in Go
|
2018-05-26 12:37:17 +08:00
|
|
|
|
|
|
|
## What It Does
|
|
|
|
|
2018-05-26 13:55:33 +08:00
|
|
|
[k-means clustering](https://en.wikipedia.org/wiki/K-means_clustering) partitions
|
2018-05-28 02:17:39 +08:00
|
|
|
a multi-dimensional data set into `k` clusters, where each data point belongs
|
|
|
|
to the cluster with the nearest mean, serving as a prototype of the cluster.
|
2018-05-26 12:37:17 +08:00
|
|
|
|
|
|
|
![kmeans animation](https://github.com/muesli/kmeans/blob/master/kmeans.gif)
|
|
|
|
|
2018-05-28 09:11:35 +08:00
|
|
|
## When Should I Use It?
|
|
|
|
|
|
|
|
- When you have numeric, multi-dimensional data sets
|
|
|
|
- You don't have labels for your data
|
|
|
|
- You know exactly how many clusters you want to partition your data into
|
|
|
|
|
2018-05-26 13:34:34 +08:00
|
|
|
## Example
|
|
|
|
|
2018-05-26 13:51:58 +08:00
|
|
|
```go
|
2018-06-02 20:06:07 +08:00
|
|
|
import (
|
|
|
|
"github.com/muesli/kmeans"
|
|
|
|
"github.com/muesli/clusters"
|
|
|
|
)
|
2018-05-26 13:34:34 +08:00
|
|
|
|
2018-05-28 07:05:38 +08:00
|
|
|
// set up a random two-dimensional data set (float64 values between 0.0 and 1.0)
|
2018-06-02 20:06:07 +08:00
|
|
|
var d clusters.Observations
|
2018-05-28 07:05:38 +08:00
|
|
|
for x := 0; x < 1024; x++ {
|
2018-06-02 20:06:07 +08:00
|
|
|
d = append(d, clusters.Coordinates{
|
2018-05-28 07:05:38 +08:00
|
|
|
rand.Float64(),
|
|
|
|
rand.Float64(),
|
|
|
|
})
|
2018-05-26 13:34:34 +08:00
|
|
|
}
|
|
|
|
|
2018-05-26 13:36:22 +08:00
|
|
|
// Partition the data points into 16 clusters
|
2018-05-28 07:05:38 +08:00
|
|
|
km := kmeans.New()
|
2018-05-26 14:07:28 +08:00
|
|
|
clusters, err := km.Partition(d, 16)
|
2018-05-26 13:34:34 +08:00
|
|
|
|
|
|
|
for _, c := range clusters {
|
2018-06-02 20:06:07 +08:00
|
|
|
fmt.Printf("Centered at x: %.2f y: %.2f\n", c.Center[0], c.Center[1])
|
|
|
|
fmt.Printf("Matching data points: %+v\n\n", c.Observations)
|
2018-05-26 13:34:34 +08:00
|
|
|
}
|
|
|
|
```
|
|
|
|
|
2018-05-26 15:02:30 +08:00
|
|
|
## Complexity
|
|
|
|
|
|
|
|
If `k` (the amount of clusters) and `d` (the dimensions) are fixed, the problem
|
|
|
|
can be exactly solved in time O(n<sup>dk+1</sup>), where `n` is the number of
|
|
|
|
entities to be clustered.
|
|
|
|
|
|
|
|
The running time of the algorithm is O(nkdi), where `n` is the number of
|
|
|
|
`d`-dimensional vectors, `k` the number of clusters and `i` the number of
|
|
|
|
iterations needed until convergence. On data that does have a clustering
|
|
|
|
structure, the number of iterations until convergence is often small, and
|
|
|
|
results only improve slightly after the first dozen iterations. The algorithm
|
|
|
|
is therefore often considered to be of "linear" complexity in practice,
|
|
|
|
although it is in the worst case superpolynomial when performed until
|
|
|
|
convergence.
|
|
|
|
|
2018-05-27 21:54:41 +08:00
|
|
|
## Options
|
|
|
|
|
2018-05-26 15:02:30 +08:00
|
|
|
You can greatly reduce the running time by adjusting the required delta
|
2018-05-27 21:54:41 +08:00
|
|
|
threshold. With the following options the algorithm finishes when less than 5%
|
|
|
|
of the data points shifted their cluster assignment in the last iteration:
|
|
|
|
|
|
|
|
```go
|
2018-05-28 04:06:57 +08:00
|
|
|
km, err := kmeans.NewWithOptions(0.05, nil)
|
2018-05-27 21:54:41 +08:00
|
|
|
```
|
|
|
|
|
2018-05-27 21:57:24 +08:00
|
|
|
The default setting for the delta threshold is 0.01 (1%).
|
|
|
|
|
2018-05-27 21:54:41 +08:00
|
|
|
If you are working with two-dimensional data sets, kmeans can generate
|
|
|
|
beautiful graphs (like the one above) for each iteration of the algorithm:
|
2018-05-26 15:02:30 +08:00
|
|
|
|
|
|
|
```go
|
2018-05-28 04:06:57 +08:00
|
|
|
km, err := kmeans.NewWithOptions(0.01, kmeans.SimplePlotter{})
|
2018-05-26 15:02:30 +08:00
|
|
|
```
|
|
|
|
|
2018-05-27 21:54:41 +08:00
|
|
|
Careful: this will generate PNGs in your current working directory.
|
|
|
|
|
2018-05-28 04:06:57 +08:00
|
|
|
You can write your own plotters by implementing the `kmeans.Plotter` interface.
|
|
|
|
|
2018-05-26 12:37:17 +08:00
|
|
|
## Development
|
|
|
|
|
|
|
|
[![GoDoc](https://godoc.org/github.com/golang/gddo?status.svg)](https://godoc.org/github.com/muesli/kmeans)
|
|
|
|
[![Build Status](https://travis-ci.org/muesli/kmeans.svg?branch=master)](https://travis-ci.org/muesli/kmeans)
|
2018-05-26 17:33:14 +08:00
|
|
|
[![Coverage Status](https://coveralls.io/repos/github/muesli/kmeans/badge.svg?branch=master)](https://coveralls.io/github/muesli/kmeans?branch=master)
|
2018-05-26 12:37:17 +08:00
|
|
|
[![Go ReportCard](http://goreportcard.com/badge/muesli/kmeans)](http://goreportcard.com/report/muesli/kmeans)
|