Kmeans😋

two gaussians dataset	kmeans with 2 clusters	kmeans with 5 clusters

Minimal, readable, simple code for k-means algorithm in Pytorch.

Just several lines of code to understand the k-means algorithm intuitively.

# assume X is a tensor of shape (n_samples, d_features)
# and we want to cluster these samples into v_clusters clusters
X = torch.rand(n_samples, d_features)

# random select v_clusters rows from X as initial codebooks, (v_clusters, d_features)
codebooks = X[torch.randint(num_samples, (v_clusters,), dtype=torch.long)] 

for i in range(num_iterations):
    # Euclidean distance, (n_samples, v_clusters)
    distance = torch.cdist(X, codebooks, p=2) 

    # assign each sample to the nearest codebook, (n_samples,)
    codes = distance.argmin(dim=-1) 

    # update codebooks with the mean of samples in the same cluster
    for k in range(v_clusters):
        codebooks[k] = X[codes == k].mean(dim=0)

# codes: the cluster index or label of each sample, (n_samples,)
# codebooks: the final codebooks, (v_clusters, d_features)

For more details, please refer to the kmeans.ipynb, here also contains the animation of processing k-means algorithm.

pip install numpy torch einops

Note:

utils.py contains the code to generate the two gaussians dataset and the animation code.

more.py contains some of the product_quantization demo code when I was struggling with fairseq wav2vec2.0😭😭😭.

donglinkang2021/kmeans_torch

Kmeans😋

More