drisso/zinbwave

Unsuccessful parameter recovery & unexpected distribution of estimated W values

gbiele opened this issue · 3 comments

I have difficulties when trying to recover parameters used to simulate data with zinbSim.
zinbFit does not recover W as well as I would have thought, that is not better than a simple PCA applied to the same (scaled) count data. This figure (link) shows the distributions of correlations of true and estimated W for 100 simulated data sets, were N = 1000, K = 2, J = 9 and fitting was done with default epsilon.
Increasing epsilon (up to 1000, larger values would not work as discusses here) somewhat improved the results in that fitted values in W looked more like coming from a normal distribution (with default epsilon values the W values sometimes looked like coming from a half normal).
Still, one unexpected pattern is that the average component score is generally around .9 for the first component and around 0 for the second component (example). I had expected components to have zero means.

My questions are:
Is what I am observing a general problem, or should I be able to get better results with by adjusting parameters that specify how the fitting is done? If so, which parameter settings are most likely to improve fit?
Maybe the expected difference between zinbwave and a PCA is just not big if one sets, as I have done, which_X_mu, which_X_pi, and which_V_pi to integer(0)?

Thanks in advance!
Guido

Code for simulation:

library(psych)
library(zinbwave)

N = 1000
K = 2
J = 9

S = 100
sims = vector(length = S, mode = "list")

for (s in 1:S) {
  set.seed(s)
  W = matrix(rnorm(N*K),ncol = K) + 3
  V = matrix(sort(rep(log(c(25,250,1000)),3)), nrow = J)
  X = matrix(1, ncol = 1, nrow = N)
  
  alpha_mu = matrix(rnorm(K*V),ncol = J, nrow = K)/15
  alpha_pi = matrix(rnorm(K*V),ncol = J, nrow = K)
  
  # generate alpha_pi values that result
  # in desired zero inflations for oberved data  
  mup = rep(c(-4,-2,0),3)
  for (v in 1:J) {
    y = scale(W %*% alpha_pi[,v]) + mup[v]
    lmfit = lm(y ~ -1 + W)
    alpha_pi[,v] = coefficients(lmfit)
  }
  
  mm = zinbModel(X = X,
                 V = V,
                 which_V_mu = 1L,
                 which_X_mu = integer(0),
                 which_X_pi = integer(0),
                 which_V_pi = integer(0),
                 W = W,
                 gamma_mu = matrix(1,nrow = 1,ncol = N),
                 alpha_mu = alpha_mu,
                 alpha_pi = alpha_pi,
                 zeta = log(rep(5,J)))
 my_counts = zinbSim(mm,seed = s)$counts
  
  
 seY = SummarizedExperiment(assays = list(counts=my_counts),
                             rowDat = data.frame(V),
                             colDat = data.frame(X))
  
 sf = zinbFit(seY,
               V = "~ -1  + V",
               K = 2,
               stop.epsilon.optimize = 1e-06,
               nb.repeat.initialize = 10,
               maxiter.optimize = 25,
               which_X_mu = integer(0),
               which_X_pi = integer(0),
               which_V_pi = integer(0),
               epsilon = 1000)

  sims[[s]] = list(seed = s,
                   W = W, V = V, X = X, alpha_mu = alpha_mu, alpha_pi = alpha_pi,
                   zinbModel = mm,
                   zinbSim = tmp,
                   seY = seY,
                   zinbFit = sf,
                   pca_fit = principal(t(my_counts),nfactors = 2,scores = T))
  save(sims,file = "zinb_sims.Rdata")
} 

I am seeing now that I maybe caused the problem myself by simulating W values that do not have zero means in the first place!
I'll try what happens if I simulate W with zero mean and instead allow an influence of V on pi.

I have now tried to recover W for 100 data sets where the generative model is in line with the zinbwave model, i.e. columns in W are normally distributed and have a mean of 0.

The good news is that if I specify a reasonable generative model (where columns in W have mean of zero) zinbwave on average recovers W very well and better than a simple PCA (see here). (I imagine that if one makes the generative model more complex zinbwave’s example would increase.)

The bad news is that zinbwave occasionally fails badly in recovering W, without giving any indication. See here for examples where zinbwave fails, and here for all correlations between true and estimated W.

One characteristic that failed fits appear to share is that they result bimodal distribution of at least one of the estimated columns in W.

Is it possible that this has something to do with how penalization constrains the formulation of matrices W? I made simple checks to see if changing epsilon would improve the fit, but this did not help.

My basic questions are, what approaches there could (a) reliably help to detect failed fits in when one does not know the true W values and (b) what options there are to improve model fit.

Thanks for the thorough benchmark! This is very useful!

One thing that would be useful to try to explore this behavior is the exact list of parameters that you're using. Or are these results related to the code you posted above?

If that's the case, one possible issue could be the small value of J. When estimating the gamma coefficients related to the V covariates, J is essentially the sample size. And a sample size of 9 could be not enough to correctly estimate the parameters in certain cases.

I wonder what happens if you remove the covariates in V and/or if you increase J.

An alternative explanation could be related to W itself. Perhaps there is (randomly) a correlation between the simulated V and the simulated alpha? Or between the simulated alpha and the simulated beta? In other words, it could be worth exploring in more details the simulations that didn't work to see if you see a pattern.