This crate is based on ideas and algorithms originally developed in C# in this repository:
https://github.com/paulchernoch/HilbertTransformation
For an understanding of the theory, see SLASH.pdf in the clusterphobia
github repo:
https://github.com/paulchernoch/clusterphobia
Clusterphobia is intended to provide clustering tools for scalable, high-dimensional, unassisted, flat, single classification of data.
- Scalable: A bedrock feature of the library is the use of the Hilbert Curve (from the hilbert crate) to speed up k-nearest neighbor searches. This algorithm is linear in the number of dimensions, whereas many others grow exponentially with the number of dimensions. Parts of the algorithm have running times proportional to N Log N, where N is the number of points being clustered.
- High-dimensional - The algorithm will work with up to ten thousand dimensions.
- Unassisted - Clustering is performed on unlabeled data.
- Flat - The algorithms group items into flat, mutually exclusive sets of categories, not hierarchies.
- Single - Each item is assigned a single category. Items cannot be part of multiple categories.
Two main approaches will be used to cluster data:
- bottom-up, single-link agglomerative clustering
- bottom-up, density-based clustering
Once complete, the crate will have these main areas of functionality:
- Data preparation : Adjust raw data until the coordinate values are commensurate, non-negative integers. The adjustments include scaling, shifting and weighting each dimension. For documents modeled as bags-of-words, a novel, randomized transformation will be used. (Some of the transformations are supplied by the Hilbert crate.)
- Point representation : Convert the prepared data into
Points
suitable for the clustering algorithms. (ThePoint
structure is defined in the Hilbert crate and employs an optimized Euclidean distance formula, critical for nearest neighbor searches.) - Linkage analysis - The
Point
data is analyzed (using the Hilbert curve) and the linkage distance and density threshold are derived. These values are crucial for the clustering algorithms. (Two points separated by less than the linkage distance are assumed to belong to the same cluster. If a set number of points fall inside a region whose radius does not exceed the density threshold, thePoint
density is considered high enough to force a clustering.) - Cluster representation - A
Clustering
holds oneCluster
for each category in the partitioning scheme. It supports operations like adding points to clusters, recategorizing them, and merging clusters. TheClustering
is the principal output of the clustering process. - Cluster Similarity - To test and iteratively refine choices of dimensions, scaling factors, thresholds and tunable parameters, you need to be able to compare a perfect, assisted solution to the unassisted clustering. This crate provides an optimized implementation of the B-Cubed measure with a linear (not a quadratic) runtime.
- Clustering algorithms - These algorithms will arrange a collection of
Points
into aClustering
. - Problem data - The same test data sets show up in numerous papers, because they contain features that defeat many clustering algorithms, such as spirals, clusters with noise or partial overlaps. The algorithms in this crate will be tuned to be able to address as many as possible of these problems.
The following features are ready for use:
- Some data preparation transforms (from the hilbert crate), including
IntegerDataRange
andFloatDataRange
. - Hilbert Curve Transformation, permutations and sorting (from the hilbert crate)
Point
struct with optimized distance formula (from the hilbert crate)Clustering
struct which can be used to build and modify classification schemes.BCubed
struct which can represent a similarity measure and compute the similarity between two clusters (essential for unit tests and tuning).
The literature is teaming with algorithms for cluster similarity. None handle all the edge cases perfectly. A measure that handles many commons use cases well and is not too expensive to run is called B-Cubed, named after initials taken from the names of its creators. Since initially proposed, others have published modifications to it.
The B-Cubed measure was proposed in this 1998 paper:
[1] A. Bagga and B. Baldwin. Entity-based cross-document coreferencing using the vector space model. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics - Volume 1, ACL โ98, pages 79โ85, 1998.
The following paper compared many clustering algorithms and found B-Cubed the best according to four formal constraints:
- Cluster Homogeneity
- Cluster Completeness
- Rag Bag
- Cluster Size vs quantity
[2] A comparison of Extrinsic Clustering Evaluation Metrics based on Formal Constraints by Enrique Amigo, Julio Gonzalo, Javier Artiles, Felisa Verdejo of the Departamento de Lenguajes y Sistemas Informaticos UNED, Madrid, Spain, May 11, 2009
The above is available here: http://nlp.uned.es/docs/amigo2007a.pdf
A subsequent paper identified a use case where B-Cubed fared poorly: unbalanced datasets where one cluster dominates:
[3] Adapted B-CUBED Metrics to Unbalanced Datasets by Jose G. Moreno and Gaรซl Dias, both of Normandie University in France.
This third paper proposed a refined version of B-Cubed, but the added complexity adds significantly to processing time, so those refinements are not employed here. The definition of the algorithm used here is taken from section 2.1 of this last paper, where it combines the Precision and Recall values into a single number using the F-measure formula (a harmonic average). (The refined version is in section 2.2.)
๐ฝ = F-measure (final similarity measure)
โ = Precision (a measure of homogeneity)
โ = Recall (a measure of completeness)
ฮฑ = Weighting factor (defaults to 0.5)
โ = Number of points
k = Number of categories (varies between the ฯ and ฯ* Clusterings)
i = category index
ฯแตข = cluster solution for the ith category
ฯ*แตข= gold standard for the ith category
gโ = tests whether two items share the same category in the clustering
g*โ= tests whether two items share the same category in the gold standard
๐ ฮฑ ๐ - ฮฑ
โโโโโ โ โโโโโ + โโโโโ
๐ฝ โ โ
bยณ bยณ bยณ
k
โ ๐ โฒ ๐ โฒ โฒ
bยณ โ โโโ โณ โโโโโ โณ โณ g*โ(xโฑผ,xโ)
โ i=1 |ฯแตข| xโฑผโฯแตข xโโฯแตข
k
โ ๐ โฒ ๐ โฒ โฒ
bยณ โ โโโ โณ โโโโโ โณ โณ gโ(xโฑผ,xโ)
โ i=1 |ฯ*แตข| xโฑผโฯ*แตข xโโฯ*แตข
( ๐ โบ โl:xแตขโฯโ โง xโฑผโฯโ
gโ(xแตข,xโฑผ) โ <
( ๐, otherwise
( ๐ โบ โl:xแตขโฯ*โ โง xโฑผโฯ*โ
g*โ(xแตข,xโฑผ) โ <
( ๐, otherwise
If seeing the triply-nested summation symbols makes you balk and say, "that will take forever to run", that was my reaction as well. The inner two loops perform a sort of second moment computation, trying to see what proportion of items that share the same cluster according to one scheme share the matching clusters in the other scheme, and vice versa. Here is an example.
1 2 3 4
A A B B
โโโโโโโโโโโ
1 Aโ โโ โ
2 Aโ โโ โ
3 Bโ โโ โ
4 Bโ โโ โ
โโโโโโโโโโโ
In this example:
- The first clustering has items 1,2,3,4 all in the same cluster.
- The second clustering puts 1&2 in cluster A and 3&4 in cluster B.
- The grid shows which cluster in the second clustering each pont goes, and pairs of points that continue to be clustered together are marked with checks (โ).
Measuring the effect of the second clustering on the first clustering's items consists in counting the checkmarks in the grid. The full B-Cubed measure normalizes the measure for each cluster by its cluster size.
The optimization that I found was that if two items end up in the same cluster, there will be four check marks, but if three match, there will be nine checks, etc. Thus if you had a list of all the categories in the second clustering that points map into and a count of how many items end up in each category, then you merely need to sum the squares of those counts.
I figured out another trick - you could do this in a single pass! Keeping a running sum of the squares of counts in each category, if you see a count was one and increase it to two, then you add the difference between one and four, which is three. If going from two to three, the increase in the sum of squares is 9 - 4 = 5. The difference is always 2 * previous_count + 1
.
Anyway, that is how I reduced a quadratic algorithm to a linear one.
The next functionality that will be implemented is Linkage analysis.