kgori/sigfit

Address model selection for multiple extraction

Closed this issue · 3 comments

Although the current version of extract_signatures() uses LOOIC (loo::loo) in order to choose the "best" number of signatures, when using a vector of nsignature values, it does not seem to be as strict as BIC. For the 119 breast cancers data set, with a range of 4–8 signatures, LOOIC recommends 8 signatures (the larger the number of signatures, the lower the LOOIC). It also normally complains that almost 100% of the Pareto K estimates are too high. According to the Stan people, in this case one should go for K-fold cross validation instead of LOO: https://groups.google.com/forum/#!topic/stan-users/E84lzW7Zqm0

Another option is computing BIC/AIC manually for the log_lik coming from each iteration and taking an average estimate.

I've tried making approximations to BIC and AIC (both probably wrong) using the average/maximum log_lik, and they always recommend the lowest number of signatures. Another problem is that the current log_lik in the models is returning a single log likelihood value for each sample, not 96. Perhaps this is behind this issue.

kgori commented

I also find that LOOIC tends towards higher numbers of signatures. I don't know anything about k-fold cross validation in Bayesian estimation, beyond skimming the thread you linked to. Maybe I'll read Vehtari (2015) and see if it's something we could implement.

On the other hand, it's trivial to add per-iteration BIC estimates in the generated quantities blocks of the NMF and EMu models - we just need to count the number of parameters being estimated, which is roughly G(S-1) + S(C-1), plus any multipliers used in the EMu model to sync up exposures and opportunities.

Another problem is that the current log_lik in the models is returning a single log likelihood value for each sample, not 96.

WAIC (and I assume LOOIC works the same way) is based on pointwise log-likelihoods, i.e. it needs a log-likelihood value for each observation. The NMF model treats each sample as an observation, so there are only as many observations as there are samples. The EMu model treats each category count as an observation, so there are 96*samples observations.

I think we should do BIC per iteration and see if it gives the right answer. Although it would be better to make LOOIC work for us. I will try LOOIC again with the new likelihood format in the sigfit_ext_emu model.
Also, the BIC equation in the EMu paper uses the number of genomes as the number of observations (although maybe the actual implementation is different):

BIC = 2 log P(X|μ) − n (Nc − 1) log M
in our notation:
BIC = 2 log P(counts | sigs, expos, opps) – n_signatures * 95 * log G

On the other hand, this equation fails to count the exposures as parameters (it should be G(S–1) + S(C–1), as you say).

For my 119-genome dataset, the BIC still increases with number of signatures (see below). I'm using the mean of the BIC samples for each nsignatures; based on the distributions, it's the same using the mean or the median.

NMF extraction (2500 samples)

nsignatures=4 nsignatures=5 nsignatures=6 nsignatures=7 nsignatures=8 
    -87316.46     -84539.26     -82442.21     -81864.57     -81639.82 

image

EMu extraction (2500 samples)

nsignatures=3 nsignatures=4 nsignatures=5 nsignatures=6 nsignatures=7 
    -96507.75     -89207.84     -86565.76     -84349.45     -83783.56

image

Original EMu

Trying to find  1 spectrum...LLH = -5.207613e+05 BIC = -1.041977e+06
Trying to find  2 spectra....LLH = -6.347389e+04 BIC = -1.278558e+05
Trying to find  3 spectra....LLH = -5.328368e+04 BIC = -1.079294e+05
Trying to find  4 spectra....LLH = -4.994169e+04 BIC = -1.016994e+05
Trying to find  5 spectra....LLH = -4.792634e+04 BIC = -9.812276e+04 **
Trying to find  6 spectra....LLH = -4.813509e+04 BIC = -9.899428e+04

I would guess that our likelihood is higher than the original EMu one, and less sensitive to penalisation (although I think we count more parameters in the penalisation).