Initialize Error - For num of clusters >= 10 - KPrototypes errors with "Clustering algorithm could not initialize. Consider assigning the initial clusters manually."
asmitakulkarni opened this issue · 7 comments
Hi @nicodv - I am using K prototypes to cluster data that has about 40 columns - most of them categorical only 3 or 4 numeric.
The algorithm runs fine if I run it for a small number of clusters (<=5 ok). However if I increase the number of clusters to anything >=10, it gives me the initialization error. Not sure why the behavior changes on increasing the number of clusters.
I think it encountered an empty cluster to begin with and couldn't initialize - why is that an issue only at higher num of clusters? Could it be because the data is sparse?
Is there a workaround to this that I could use?
Thanks in advance for the help!
Your data might not warrant a larger number of clusters.
Run the algorithm for k=1, 2, 3, ..., 10
clusters and note the total cost at the end. If this stops decreasing, there is no need to increase k. It's called the elbow method, you can look it up.
Facing the same issue with 10k rows and 70 columns. The cost decreases until K=12 and from K=13 throws the error.
To be clear: this is a feature, not a bug.
kmodes
is telling you that what you are doing likely does not make sense given the data you're presenting it. And because every data set is different, it's up to you, the data scientist, to figure out why. :)
See relevant entry in FAQ: https://github.com/nicodv/kmodes/blob/master/README.rst#faq
Is there a workaround or do I need to patch it myself?
I am using the soybean large dataset from UCI, which has 15 classes in the target column. I am running a comparison between different clustering methods for a thesis, therefore I want to group into 15 clusters and compare with the actual targets...
Hello @nicodv , I'm running into this same issue, have read the FAQ, but would like to ask for your confirmation: if I understand correctly, does this error occur because, in the internals of the library, you perform the elbow method and automatically cut off the maximum n_clusters at the elbow?
If so, I agree with @jaanisfehling , I think it would be useful for users to be able to set the number of clusters, independently of the achievable result. Simply because, in some use cases, you might need a very specific number, even if it doesn't warrant the best performance.
Thank you and regards!
@CarlaFernandez , the elbow method is not part of this library. The choice of k
is entirely up to you.