Likelihood for methylation data
merlevede opened this issue · 7 comments
Hello,
I have a dataset with 4 data types, including methylation data.
Here (as in most cases), methylation data have a bimodal distribution.
I used gaussian as likelihood (in ModelOptions$likelihood). Don't you think it can be deleterious for the model?
Is there a way to use something more approriate?
Thank you,
Jane
Hi Jane,
if you have bisulfite sequencing data then the CpG sites will indeed be binomially distributed. We don't have this likelihood implemented and is a bit of a pain to do so..
However, if you have bulk data then you'll have enough reads to make a good estimation of a DNA methylation rate for every CpG site (met_reads / total_reads, also called the beta value). You can convert the beta-value to the M-value (which is more normally-ish distributed) with the following code:
m:=log2(((rate)+0.01)/(1-(rate)+0.01))
where rate
goes from 0 to 1.
Then you can just use the gaussian likelihood (just as we did in the CLL study in the paper).
Also, see the following work for some comparisons of beta-values and M-values: Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis
Thank you for your answer.
Yes, I have bulk data from hm27K.
I will try using M-value. This might help, as you can see in my previous post, the "strongest" factor is in methylation layer, and might an artefact since the distribution does not satisfy the requirements
In my experience, DNA methylation always has a very strong Factor that captures genome-wide differences in the global methylation state. Some samples have globally larger DNA methylation than others. A clear example is Factor 1 from the CLL data set in the paper. Patients with more severe phenotype are associated with globally low DNA methylation.
To check this with your model, just:
- Look at the distribution of loadings for the Factor: are most values skewed from 0?
- Correlate factor values wit the mean methylation rate for each sample
Thank you, I will look at that.
I ran MOFA using M values instead of beta values as you suggested, exactly as before otherwise.
I encountered the problem mentionned in your FAQ: "The model does not converge smoothly, and it oscillates between positive and negative deltaELBO values":
I stopped the iterations:
...
Iteration 2200: time=6.53 ELBO=-3339573.32, deltaELBO=645.0367, Factors=11
Iteration 2201: time=7.37 ELBO=-3340335.22, deltaELBO=-761.9044, Factors=11
Iteration 2202: time=6.32 ELBO=-3340309.89, deltaELBO=25.3275, Factors=11
Iteration 2203: time=6.86 ELBO=-3339503.70, deltaELBO=806.1983, Factors=11
Iteration 2204: time=6.61 ELBO=-3340314.82, deltaELBO=-811.1251, Factors=11
Iteration 2205: time=6.98 ELBO=-3340304.00, deltaELBO=10.8202, Factors=11
Iteration 2206: time=6.33 ELBO=-3340303.86, deltaELBO=0.1390, Factors=11
Iteration 2207: time=6.92 ELBO=-3340032.51, deltaELBO=271.3559, Factors=11
Iteration 2208: time=6.47 ELBO=-3340229.75, deltaELBO=-197.2444, Factors=11
Iteration 2209: time=6.87 ELBO=-3340281.51, deltaELBO=-51.7628, Factors=11
Iteration 2210: time=7.11 ELBO=-3339510.71, deltaELBO=770.8043, Factors=11
Iteration 2211: time=7.21 ELBO=-3340079.46, deltaELBO=-568.7471, Factors=11
Iteration 2212: time=7.18 ELBO=-3340237.04, deltaELBO=-157.5890, Factors=11
Iteration 2213: time=6.37 ELBO=-3340519.97, deltaELBO=-282.9281, Factors=11
Iteration 2214: time=6.55 ELBO=-3340365.80, deltaELBO=154.1701, Factors=11
Iteration 2215: time=6.18 ELBO=-3339101.50, deltaELBO=1264.2987, Factors=11
Iteration 2216: time=6.42 ELBO=-3339954.60, deltaELBO=-853.0998, Factors=11
Iteration 2217: time=6.12 ELBO=-3340210.51, deltaELBO=-255.9100, Factors=11
Iteration 2218: time=6.14 ELBO=-3340276.54, deltaELBO=-66.0250, Factors=11
Iteration 2219: time=6.28 ELBO=-3340532.68, deltaELBO=-256.1369, Factors=11
Iteration 2220: time=6.37 ELBO=-3340370.75, deltaELBO=161.9292, Factors=11
Iteration 2221: time=6.57 ELBO=-3340405.23, deltaELBO=-34.4877, Factors=11
Iteration 2222: time=6.55 ELBO=-3340346.07, deltaELBO=59.1615, Factors=11
Iteration 2223: time=6.66 ELBO=-3338903.14, deltaELBO=1442.9367, Factors=11
Iteration 2224: time=7.04 ELBO=-3339896.46, deltaELBO=-993.3262, Factors=11
Iteration 2225: time=6.20 ELBO=-3340406.93, deltaELBO=-510.4694, Factors=11
Iteration 2226: time=6.64 ELBO=-3340342.88, deltaELBO=64.0562, Factors=11
Iteration 2227: time=6.95 ELBO=-3340539.88, deltaELBO=-197.0048, Factors=11
Iteration 2228: time=6.49 ELBO=-3340384.15, deltaELBO=155.7289, Factors=11
Iteration 2229: time=6.96 ELBO=-3340327.80, deltaELBO=56.3493, Factors=11
Iteration 2230: time=6.23 ELBO=-3339090.72, deltaELBO=1237.0786, Factors=11
Iteration 2231: time=6.55 ELBO=-3339951.97, deltaELBO=-861.2463, Factors=11
Iteration 2232: time=6.84 ELBO=-3340209.64, deltaELBO=-257.6656, Factors=11
Iteration 2233: time=7.15 ELBO=-3340500.76, deltaELBO=-291.1278, Factors=11
Iteration 2234: time=6.37 ELBO=-3340361.84, deltaELBO=138.9267, Factors=11
Iteration 2235: time=6.32 ELBO=-3340332.29, deltaELBO=29.5427, Factors=11
Iteration 2236: time=6.23 ELBO=-3340537.16, deltaELBO=-204.8616, Factors=11
Iteration 2237: time=6.10 ELBO=-3339495.66, deltaELBO=1041.4963, Factors=11
Iteration 2238: time=6.12 ELBO=-3340073.54, deltaELBO=-577.8831, Factors=11
Iteration 2239: time=6.17 ELBO=-3340460.38, deltaELBO=-386.8411, Factors=11
Iteration 2240: time=6.12 ELBO=-3340348.11, deltaELBO=112.2691, Factors=11
Iteration 2241: time=6.16 ELBO=-3339918.04, deltaELBO=430.0778, Factors=11
Iteration 2242: time=6.35 ELBO=-3340197.44, deltaELBO=-279.4047, Factors=11
While when using the beta values, it converged in <1000 iterations and identify 9 factors...
Do you have any idea of what is happening here?
A few ideas:
- Did you add a pseudocount to avoid Inf when beta=0?
- Can you plot a histogram of the M-values?
- Is there any feature with no variance or that is full of missing values?
* Did you add a pseudocount to avoid Inf when beta=0?
Yes, I used your formula: m:=log2(((rate)+0.01)/(1-(rate)+0.01))
* Can you plot a histogram of the M-values?
* Is there any feature with no variance or that is full of missing values?
I have no features with NA rows but I have samples with NA columns
Finally, I restarted using ModelOptions$numFactors <- 11 and it could converge with 10 components