comparative analysis of clustering performance #37
stat-hejia opened this issue · 3 comments
Thanks for the reply. (#37) I download some PBMCs data as my benchmark data set, and have cell types as my true labels. just as you say, ARI and Silhouette unsuitable here. but because of the diversity, it seems too many clusters to evaluate, I dont know how to use the measures you recommend, and how many clusters(k) of my result. I learned the interlinkage you provided, but I've only learned R and Python, I can't understand the purity.hs, do you provide R or Python code or another way for my reference?
I don't have R and Python code for this process, but I'm sure there are many libraries out there to measure entropy, purity, and NMI. I don't know what you mean by too many to evaluate, you can have different cutoffs to control how far the tree goes if that is a problem and it should all be automated for evaluation in any case.
Do you mean that when I compare clustering performance and scalability ,I cannot use the leaves as my final clusters? May be I need observe the tree of result where the splitting was best,then select it as my final cluster?
You can use the leaves, but be aware that there are many statistically based ways to prune (new leaf definitions) and explore the tree at nodes closer to the root.