dm13450/dirichletprocess

Initialization to one cluster with all data points.

Closed this issue · 2 comments

Hello,

Thank you for the package it has been a great resource for me to learn about Dirichlet processes! :)

I have a quick question for you: If we initialized all the data points to one cluster, would it not be difficult for the data points to get out of that cluster given we are weighting probabilities of cluster assignments by the number of points in each cluster? Consequently, would maybe initializing at the singletons be useful?

Thanks again!
Matteo

Thanks for using my package and I'm glad it has been a help!

And yeah you're are correct in your thinking. You can think of putting all the points in one group as the local minimum of the posterior density so breaking out of that configuration depends on how flat the density is around that value. There are lots of things that can affect this flatness: prior choice, alpha value and how you have represented your data.

Likewise, starting each data point in its own cluster can also have issues. In high dimensions, clusters will struggle to appear and in low dimensions it might take longer for the sampling to converge.

In fact, it's a check of convergence, start two Dirichlet processes, one with all points in one cluster and another with all points in separate clusters. If they reach the same answer, i.e. same cluster assignments you can be reasonably sure that this is the optimal posterior density.

Hope this helps!

Awesome it really does help thanks! 👍