dm13450/dirichletprocess

Restart the chain from where it stops

Closed this issue · 8 comments

Hi Dean,
I was wondering if in the package, there is a function for restart the fitting of a DPMM object after it stops. For example, I fit a DPMM with 3000 iterations, then if I want to run the chain longer, I restart the chain from where it stopped, so the iterations number 3000.

To do so create a new function Fit2, that is very similar to Fit but take as input an already runned chain:

Fit2<- function(dpObj,its_start ,its_finish, updatePrior = FALSE, progressBar = interactive()) {
  
  if (progressBar){
    pb <- txtProgressBar(min=its_start, max=its_finish, width=50, char="-", style=3)
  }
  
  alphaChain <- dpObj$alphaChain
  likelihoodChain <- dpObj$likelihoodChain
  weightsChain <- dpObj$weightsChain
  clusterParametersChain <- dpObj$clusterParametersChain
  priorParametersChain <-   dpObj$priorParametersChain 
  labelsChain <-   dpObj$labelsChain 
  
  iteration <- its_start : its_finish
  for (i in iteration) {
    
    alphaChain[i] <- dpObj$alpha
    weightsChain[[i]] <- dpObj$pointsPerCluster / dpObj$n
    clusterParametersChain[[i]] <- dpObj$clusterParameters
    priorParametersChain[[i]] <- dpObj$mixingDistribution$priorParameters
    labelsChain[[i]] <- dpObj$clusterLabels
    
    
    likelihoodChain[i] <- sum(log(LikelihoodDP(dpObj)))
    
    dpObj <- ClusterComponentUpdate(dpObj)
    dpObj <- ClusterParameterUpdate(dpObj)
    dpObj <- UpdateAlpha(dpObj)
    
    if (updatePrior) {
      dpObj$mixingDistribution <- PriorParametersUpdate(dpObj$mixingDistribution,
                                                        dpObj$clusterParameters)
    }
    if (progressBar){
      setTxtProgressBar(pb, i)
    }
  }
  
  dpObj$weights <- dpObj$pointsPerCluster / dpObj$n
  dpObj$alphaChain <- alphaChain
  dpObj$likelihoodChain <- likelihoodChain
  dpObj$weightsChain <- weightsChain
  dpObj$clusterParametersChain <- clusterParametersChain
  dpObj$priorParametersChain <- priorParametersChain
  dpObj$labelsChain <- labelsChain
  
  if (progressBar) {
    close(pb)
  }
  return(dpObj)
}

I think it only work for the Fit.default.
Let me know if there is another way already implemented in the package, thank you.

What about if we call it FitContinue?
And should all the chains be preallocated to the correct length, rather than growing them with each iteration?

FirContinue is fine.
If I understand the second question correctly, the answer is no, or at least, I haven't done it but it still work.
For example, if you run the following code it work, even if the preallocated length is 100 and not 101:

its <- 100
my_chain <- numeric(its)
print(my_chain)
my_chain[101] <- 10
print(my_chain)

But if you want I can preallocate the correct length, it is not a problem.

I think there is a performance hit if you don't preallocate, as R will have to keep copying the array to make it bigger each time, so preallocating will make things faster.

Yeah, it is faster your way, I will change it.

Cheers mate, raise the PR and we can get that added in

Sorry, I was busy, tonight I will raise the pull request.
I have another question, I fitted 3 DPMM on a data set, each with random initialization of the number of clusters. They converged, now to estimate the posterior assignment, is there a function in the package that enables me to do that?
I search online and the most used is the MAP, but since I only really need the clusters assignments, I thought that I could just take the mode for each point, in the dp$clusterLabels. It makes sense to do this?
In case it is wrong, and your package haven't a function that implements the correct way, do you know another package that could help me?

Taking the mode makes sense or you can use ClusterLabelPredict in the package which will give a you posterior sample of the probable posterior assignment, if you do that multiple times and take the average it will be more Bayesian than just the MAP.

Thank you again for all the help.