dm13450/dirichletprocess

clustering high-dimensional data?

Opened this issue · 6 comments

Hi @dm13450 I am trying to get dirichletprocess working for clustering a high-dimensional data set.
For example https://web.stanford.edu/~hastie/ElemStatLearn/datasets/zip.train.gz has 256 features.
Using dirichletprocess::DirichletProcessMvnormal would result in a 256 x 256 covariance matrix per cluster, right?
This results in VERY SLOW inference on my computer.
One way to speed that up would be to use a constrained covariance matrix, say spherical.
Is that something that I should implement myself?
or is there some existing/recommended way to accomplish this?
Thanks

Yeah this is a problem with the multivariate normal mixtures.
I use the mvtnorm package for the multivariate density and draws. The functions from the package aren't vectorised, so lose some performance there. I've briefly looked at other packages, but haven't managed to get a drop in replacement yet.
Your constrained covariance matrix is a good idea, I've no experience with that though so would have to let you implement that!
Alternatively, you could replace the lapply/vapply functions in mvnormal_normal_wishart.R with mclapply and throw some more cores at the problem.

hey again I was thinking this may be an interesting project for a GSOC student next year, would you be interested to write a project and mentor? https://github.com/rstats-gsoc/gsoc2022/wiki/table%20of%20proposed%20coding%20projects

Hey thanks for bringing this up. Am I right in understanding that there would need to be another mentor for me to submit this as a project?
If so, is there a way for me to find another mentor to pair up with?

If you have a co-developer for this R package that would be a good idea for a co-mentor. otherwise I could do it (and guide you through the process of GSOC because I have done that many times)

Done. https://github.com/rstats-gsoc/gsoc2022/wiki/Improving-the-performance-of-multivariate-normal-models-in-dirichletprocess

Comments/feedback appreciated and is there anything else you need me to do?

looks like a good start,thanks for contributing!
You may want to revise/clarify the medium and/or hard tests (seems a little vague to me, you may want to add details about what kinds of code / plots / etc you would expect). I have found that the most useful tests are those which require the same skills as you would expect during GSOC, so you may want to add another test about mc2d or diagonal covariance matrices? (you can have more than three tests, for example Medium 1, Medium 2, etc)
Also I'm not sure there would be enough time, but it may be useful to write that our goal is to support all of the different kinds of diagonal covariance matrices as in mclust, https://rdrr.io/cran/mclust/man/mclustModelNames.html

"EII"    spherical, equal volume
"VII"    spherical, unequal volume
"EEI"    diagonal, equal volume and shape
"VEI"    diagonal, varying volume, equal shape
"EVI"    diagonal, equal volume, varying shape
"VVI"    diagonal, varying volume and shape