/Clustering4Ever

C4E, the LIPN Big Data Clustering Library gathering algorithms and quality measures.

Primary LanguageScalaApache License 2.0Apache-2.0

Clustering 4 Ever Download

Welcome to the LIPN Big Data Clustering Library gathering algorithms and quality indexes.

You will find additional contents about clustering algorithms here.

Don't hesitate to ask questions or recommendations in our Gitter.

Basic usages of implemented algorithms are exposed with SparkNotebooks in Spark-Clustering-Notebook organization.

Include it in your project

Add following lines in your build.sbt:

  • "clustering4ever" % "clustering4ever_2.11" % "0.2.3" to your libraryDependencies
  • resolvers += Resolver.bintrayRepo("clustering4ever", "Clustering4Ever")

You can also take specifics parts :

  • core Download
  • clusteringscala Download
  • clusteringspark Download

Distributed algorithms, through Spark

Clustering algorithms

Scalar data

Batch
  • K-Means
    • Implementation allowing the choice of the dissimilarity measure.
    • Complexity O(k.n.t)
    • Warning* -> works only with Euclidean distance at the moment
  • Self Organizing Maps
  • Mean Shift
    • Complexity
      • Initial complexity O(n2)
      • Improved complexity O(n) under some conditions
Streaming

Binary data

  • K-Modes
    • Complexity O(k.n.t)
    • Implementation allowing the choice of the dissimilarity measure.
    • Warning* -> works only with Hamming distance at the moment

Mixed data

  • Self Organizing Maps
    • Mixed topological Map

* We deliberately choose to not implement other distances than Hamming and Euclidean for Spark version of K-Modes and K-Means for reason explain in their Scala cousins versions.

Preprocessing algorithms

  • Gradient ascent
  • Feature selection

Pure Scala algorithms

Clustering algorithms

A good scala clustering complementary library aka Smile

Scalar data

  • Jenks Natural Breaks
    • A mono dimensionnal clustering
  • K-Means
    • Complexity O(k.n.t)
    • Implementation allowing the choice of the dissimilarity measure.
    • Warning -> with another distance than Euclidean, similarity matrix in O(n2) of each cluster is computed to find the best prototype, depending on cluster size it can becomes way slower than Euclidean

Binary data

  • K-Modes
    • Complexity O(k.n.t)
    • Implementation allowing the choice of the dissimilarity measure.
    • Warning -> with another distance than Hamming, similarity matrix in O(n2) of each cluster is computed to find the best prototype, depending on cluster size it can becomes way slower than Hamming

External indexes

  • Mutual Information (scala & spark)
  • Normalized Mutual Information (scala & spark)

Internal indexes

  • Davies Bouldin (scala)
  • Silhouette (scala)