BGU-CS-VIL/dpmmpython

Method for prediction given new data

Closed this issue · 11 comments

It would be desirable to have predict() and predict_proba methods comparable to the methods in scikit-learn.

These methods would produce the predicted label and the posterior probability of membership in each component (equivalently the mixture over components), respectively, for new data. They would take as input a k-dimensional Numpy array where k is the dimensionality of the data. predict() would output a scalar and predict_probability would output a 1xk Numpy array.

Thanks, we intend to implement this feature, both in the Julia package and the python wrapper.

Yes, that should do it. Thanks.

This is a very quick and dirty implementation, but should be refactored to do the computation in Julia with your mv_gaussian objects --- I think the changes are obvious but this was the quickest way to push working code (and avoid spending too much time double-checking math): https://github.com/nclarkjudd/dpmmpython/tree/predict

Looks good, just two things -

  1. I might be missing something, but you are not using the weights for the predict_proba?
  2. Working with the pdf might cause numerical errors, especially in higher dimensions, it is preferable to work with the log pdf.
  1. Whoops, yes, I skipped a rather important step there and have fixed it.
  2. Thanks for refreshing my memory on this. I've pushed a commit that begins to address this, but there are still unacceptable numerical errors. I'm sure there's a convenient way of factorizing the denominator in the equation for responsibilities to use log-sum-exp or some similar trick, but if I ever learned it, I've forgotten it.

I might as well just manipulate the terms of each pdf directly rather than seeking a lazy solution prone to floating point errors.

I'd like to use invChol from mv_gaussian.jl.

More precisely the question is, for some component k is det(invChol)^2 = det(Σ) or the determinant of the inverse of Σ?

It turns out np.logsumexp takes a weight argument that can be used to weight the log of sums of exponentials, so it is easier to manipulate the pdfs directly (rather than using scipy.stats.multivariate_gaussian) than I thought.

This should take care of all underflow problems.

(invChol)^2 == det(invΣ)

Yes, log sum exp should do the trick.

So it turns out that there's a different way of doing things that is probably more numerically stable. The "pure Python" implementation here works: https://github.com/nclarkjudd/dpmmpython/tree/master/dpmmpython

It's implemented as a new class that inherits from DPMMPython because it is more opinionated about what to save from the model results. I wanted to offer something very stand-alone but also very Pythonic.

I'm happy to submit a pull request if you want to pull it in as-is, but I expect you may want to refactor this. If I was you I would want to do the math in Julia and then pass the results to the Python user among other changes.

Thanks, I will leave this issue open for now. I do intend to implement something similar in Julia and pass the results to Python, as you suggest. So no need to create a pull request at the moment.

Finally had the time to add this.
Included in the Julia package version 0.1.12 (which will be updated in the package repository soon) is the predict method.
The predict uses the predictive posterior for both the clusters and the weights, the python interface has been updated as well, an example will be added to the README.