Cluster of size 0 (with EM algorithm)
bastian-wur opened this issue · 8 comments
Hi everyone,
currently I'm trying to cluster some artificial data. While I'm horribly failing at it (due to noise, I think), I stumbled about some weird occurrence in one of my results.
I clustered my data with the EM algorithm. In multiple instances I get clusters of size 0.
How can that happen?
I don't think this is intended behavior, right?
I've attached 2 results, EM algorithm with k set to 7 and 8, with the former having 4 empty clusters and the latter 1 empty cluster.
I've also attached the input table which I've used to get these results.
My Elki version is 7.2 from December 4, cloned from here.
It's running with java version "1.8.0_144" on an Ubuntu 14.04 (cannot update that, it's a cluster).
Hope someone can have a look at this :).
Regards,
Bastian
EM7.tar.gz
EM8.tar.gz
absolute_counts_per_contig.csv.percentages_per_row.tar.gz
This is the intended and correct behavior in certain cases.
EM is a soft clustering algorithm. The mapping to discrete cluster assignments always loses information. If a point is 0.51 in cluster A, and 0.49 in cluster B, then cluster B can appear to be empty in a hard assignment, when it in fact explains .49% of the total distribution. You can set a flag to retain the soft assignment, but you'll need to handle this in your own code.
It is typical for EM to find one cluster with a low density and high variance if there is noise in the data. Because that is the only way to explain the background noise with a Gaussian model. If you have a lot of duplicates, the default initialization strategy may also be unable to separate these clusters, as they will always produce the exact same densities. K-means++ initialization will then work better.
But there may be an issue in the current EM, the results don't look quite right to me, although the unit tests sill pass. But I don't have time to further investigate this myself. So please also try an earlier version, such as the latest release (there is no 7.2 yet, you are using the github head) and report if that works better for you.
The numerics change that I was thinking of wasn't committed to the main branch yet. So it cannot have introduced a new issue yet.
It would be cool if you could investigate this further (does it work better with k-means++ initialization, does it also occur in release 0.7.1, or is there a good way to avoid that EM also learns "background" clusters with high variance but low density - which makes sense from the need to explain background noise in noisy data). Thank you.
For what exactly should I be looking for in the results? (or what exactly do you think looks wrong right now?)
I just ran the same data with 0.7.1 and k-means++ initialization, no difference, getting again 4 empty clusters for k=7. Have not checked for all different k, but also k=8 contains again empty clusters.
Some more info about the data:
In theory there shouldn't be any clusters with high variance but low density. In theory all should be dense with low variance (although for the variance I cannot be sure, since my data is inherently noise; but in theory it shouldn't).
The data matrix is also rather dense, only 2 of the 10 clusters have 0 values in 1 and 2 of the 5 dimensions. Some of the clusters are probably hard to separate, since they only contain differences in 1 out of the 5 dimensions.
I am referring to the EM models, not so much to your data.
If you have noisy data, I believe it is expected behavior of EM to use some "clusters" to explain the background noise. E.g., if you have most data from a standard normal N(0;1), but e.g. 10% of the data is uniform on [-5;5]. Then a two-Gaussian model will likely have one Gaussian like N(0;1.1) for >95% of the data, and a second Gaussian with a larger variance and <5% of the data, because it cannot explain the uniform background except by a high variance, low density "background" Gaussian. Because they only account for 5% of the mass, they may end up appearing as 0 point clusters, but they do contribute to the soft explanation of the model as being a bit heavier tailed than a pure normal distribution.
For example, if I increase the density of the "mouse" data set by a factor of 5 and use k=4, I get the following clusters:
The green cluster is not empty, but a low-density, high-variance cluster that contains the noise points surrounding the mouse. This is a very good result (ARI 0.9629), with k=3 I only get ARI 0.9530 because it does not put the noise points into a separate cluster, as labeled in the generated data.
But in this example, the noise cluster is still dense enough to be non-empty. It best explains the outside points.
In general, I believe one cannot guarantee EM clusters to be non-empty. When fitting a two-gaussian model on a single Gaussian distribution (with outliers removed), I would expect one Gaussian to eventually become 0. But I am not sure if this should happen on your data.
I experimented a bit, and there is some numeric issue arising from the data. Maybe because of duplicate attribute values (0.0 and exactly 0.20 are quite common). But it shouldn't break EM that easily.
A workaround is to add a tiny error to each value. For example, using the filter
-dbc.filter transform.PerturbationFilter -perturbationfilter.percentage 0.0001
and you appear to get a much more plausible result. This seems to be an instance of the so-called "singularity" issue with Gaussian Mixture Modeling, that we apparently do not yet handle correctly. :-(
But there might be no good solution, in fact. Consider the following case: one cluster gets all the points with x2=0, and nothing else. This component then has variance 0, and the multivariate Gaussian degenerates to a "distribution" that has infinite height and zero width. By adding random noise with the perturbation filter, you can prevent this from happening.
I am closing the bug because the literature on EM clustering indicates that such instabilities are to be expected with Maximum-Likelihood Estimation. See, e.g., the textbooks by Christopher Bishop: "Pattern Recognition and Machine Learning" and Kevin P. Murphy "Machine Learning: A Probabilistic Perspective".
I hope that in the next few days I will have time to push some commits to the master branch that will
- warn if they detect an instability
- add a MAP estimation to EM, that is more robust (but needs more parameters, I still need to work on the parameterization options, while I have a working code that handles your data okay)
- maybe automatically switch to MAP estimation if they detect such an instability.
Nevertheless, the PerturbationFilter approach is so far, in my opinion, giving the best results on your particular data set. Because this extra little bit of deviation added solves the cause of the instabilities (that originate from your synthetic data simulation process) originating from exact duplicate attribute values. For the purpose of clustering such data, adding a small perturbation (aka "jitter") to the data is a very reasonable approach (albeit jittering is most common to reduce overplotting in scatterplots).
P.S. on normalized histograms, divergence based distance functions such as χ² may be more meaningful than anything Euclidean like EM. I would suggest to try
-algorithm clustering.hierarchical.extraction.ClustersWithNoiseExtraction
-algorithm AnderbergHierarchicalClustering
-algorithm.distancefunction probabilistic.HellingerDistanceFunction
-hierarchical.linkage CompleteLinkage
-extract.k 8 -extract.minclsize 10
which, in my opinion, gives much better results on this data.
Sorry for replying so late, was very busy.
Yes, it seems this algorithm is not suitable for what I want. I used it only because I know it's a very commonly used algorithm, and did not read up about how it works (in contrast to e.g. k-means or dbscan, where I have a good idea). Your explanation makes very much sense when I look at my data, I now see how it happens.
I prefer to not use hierarchical algorithms, because my data is normally way too big to make a sense out of the hierarchy, although I could maybe try it (but that's rather on the bottom of the list).
Thanks for the help :).