Clustering 4 Ever

Welcome to the LIPN Big Data Clustering Library gathering algorithms and quality indexes.

You will find additional contents about clustering algorithms here.

Don't hesitate to ask questions or recommendations in our Gitter.

API documentation

SparkNotebook

Basic usages of implemented algorithms are exposed with SparkNotebooks in Spark-Clustering-Notebook organization.

Include it in your project

Add following lines in your build.sbt:

"clustering4ever" % "clustering4ever_2.11" % "0.2.3" to your libraryDependencies
resolvers += Resolver.bintrayRepo("clustering4ever", "Clustering4Ever")

You can also take specifics parts :

core
clusteringscala
clusteringspark

Distributed algorithms, through Spark

Clustering algorithms

Scalar data

Batch

K-Means
- Implementation allowing the choice of the dissimilarity measure.
- Complexity O(k.n.t)
- Warning* -> works only with Euclidean distance at the moment
Self Organizing Maps
Mean Shift
- Complexity
  - Initial complexity O(n²)
  - Improved complexity O(n) under some conditions

Streaming

GStream

Binary data

K-Modes
- Complexity O(k.n.t)
- Implementation allowing the choice of the dissimilarity measure.
- Warning* -> works only with Hamming distance at the moment

Mixed data

Self Organizing Maps
- Mixed topological Map

* We deliberately choose to not implement other distances than Hamming and Euclidean for Spark version of K-Modes and K-Means for reason explain in their Scala cousins versions.

Preprocessing algorithms

Gradient ascent
Feature selection

Pure Scala algorithms

Clustering algorithms

A good scala clustering complementary library aka Smile

Scalar data

Jenks Natural Breaks
- A mono dimensionnal clustering
K-Means
- Complexity O(k.n.t)
- Implementation allowing the choice of the dissimilarity measure.
- Warning -> with another distance than Euclidean, similarity matrix in O(n²) of each cluster is computed to find the best prototype, depending on cluster size it can becomes way slower than Euclidean

Binary data

K-Modes
- Complexity O(k.n.t)
- Implementation allowing the choice of the dissimilarity measure.
- Warning -> with another distance than Hamming, similarity matrix in O(n²) of each cluster is computed to find the best prototype, depending on cluster size it can becomes way slower than Hamming

beckgael/Clustering4Ever

Clustering 4 Ever

API documentation

SparkNotebook

Include it in your project

Distributed algorithms, through Spark

Clustering algorithms

Scalar data

Batch

Streaming

Binary data

Mixed data

Preprocessing algorithms

Pure Scala algorithms

Clustering algorithms

A good scala clustering complementary library aka Smile

Scalar data

Binary data

Quality Indexes

External indexes

Internal indexes