dsteinberg/libcluster

Evaluate new datapoint on model?

Meekohi opened this issue · 5 comments

After fitting, is there a way to evaluate the probability of a new datapoint having come from the model?

Good question! It depends on what algorithm you are using. If you are using one of the mixture models (VDP, BGMM etc) it is pretty straight forward. I can give you an example/add a function to this library -- it is very much like naive Bayes. Are you using C++, python or Matlab?

Unfortunately the hierarchical models (G-LDA, SCM, MCM etc) are less straight forward, because of the nature of their hierarchy. But it is still possible.

Thanks! I'm using BGMM and the C++ API directly. Currently I'm doing:

learnBGMM(trainingData, fgqZ, bgweights, bgclusters, PRIORVAL, false);
MatrixXd bgw = bgweights.Elogweight().exp();
MatrixXd c(1,3) = {1,2,3}; // or whatever
double bgP = 0;
for(int i = 0; i < bgclusters.size(); i++)
{
  MatrixXd match = bgclusters[i].Eloglike(c);
  bgP += bgw(i)*exp(match(0));
}

but I'm not sure this is correct (and I'm sure there is a more elegant way to write it with Eigen, but I haven't figured it out).

Very close! Here is a version that will handle multiple observations, and will get the likelihood of these observations and their expected labels.

I've also made sure it is numerically stable, since we usually get tiny probabilities for high dimensional Gaussians.

learnBGMM(trainingData, fgqZ, bgweights, bgclusters, PRIORVAL, false);

// Your test data (many observations)
MatrixXd X(N, D);
X << ... some data here ...;

// Initialise label and likelihood arrays
VectorXd Z(N);
MatrixXd logPk(N, bgclusters.size());

// Loop through the clusters to get per-cluster log-likelihood
for(int i = 0; i < bgclusters.size(); i++)
{
    // log(w) + log(N(c|mu_i, cov_i)) -- do this in log-space for numerical stability
    logPk.col(i) = bgweights.Elogweight()[i] + bgclusters[i].Eloglike(X).array();
}

VectorXd logP = logsumexp(logPk); // This is also for numerical stability, this
                                  // function lives in probutils.h

VectorXd P = logP.exp(); // likelihood for each observation, these may be
                         // REALLY small values! Usually it's better to use log

// Get labels -- There may be a more vectorised way, check the eigen API ...
for(int n = 0; n < Z.size(); n++)
{
    int z;
    logPk.row(n).maxCoeff(&z);
    Z[n] = z;
}

I haven't compiled this, so you may need to do a bit of syntax checking. Let me know if this works for you or if you have any more issues.

Looks successful, thanks a bunch for the help @dsteinberg!

You're welcome!