Range of total variance explained per view
merlevede opened this issue · 4 comments
Hello,
I ran MOFA on 5 datasets, 1 with 4 data types and 4 with 2 data types.
I got very different proportions of total variance explained per view and I would like a feedback about that:
A/ 0,0,0.5,0.6
B/ 0.5,0.9
C/ 0.3,0.5
D/ 0,0.5
E/ 0.25,0
-
I understand that when no variance is explained in a data type, this one is useless. I would assume that often it reflects a problem in the analysis (like imbalanced number of features across data types) rather than no biological interest. Do you agree?
-
In one dataset, I got quite good total variance explained per view: 0.5 and 0.9 for 2 layers. Is it good compared to what is found usually? Can I assume that the subsequent analysis should be "better" than when the total variance explained per view is less, for example 0.2 and 0.4?
-
In these 5 cases, I tried to have balanced design across the data types, but got in several datasets no variance explained in a layer. Is it common? Would you recommend something more (like being more stringent in the number of features kept) to change that?
Thank you in advance for your feedback
Jane
-
This is a nice point you raised, the two options are actually possible. The data set may be huge but if there is no coordinated variation (i.e. try simulating a random matrix) then the model will explain no variance. But it may also happen that the data set contains biological signal but it is very small compared to the other data sets, which would also lead to very low variance explained. As a test to check the two cases you could run MOFA (or just PCA!) with only this matrix and see if you detect some meaningful factors.
-
90% variance explained is a lot. In my experience this is usually because there is a massive Factor 1 that captures "size factor effects". In RNA-seq data this usually occurs when the data is not properly normalised by library size. in DNA metylation data this can occur if the samples have very different global methylation rates.
-
Can you give me more details? what are the data modalities and their corresponding dimensionalities (
plotDataOverview
).
If you want to have a more interactive chat, feel free to reach me via Slack channel. You can find the link in the main page of the repository.
Thank you a lot for your comments.
1. This is a nice point you raised, the two options are actually possible. The data set may be huge but if there is no coordinated variation (i.e. try simulating a random matrix) then the model will explain no variance. But it may also happen that the data set contains biological signal but it is very small compared to the other data sets, which would also lead to very low variance explained. As a test to check the two cases you could run MOFA (or just PCA!) with only this matrix and see if you detect some meaningful factors.
Nice idea, I will test that.
2. 90% variance explained is a lot. In my experience this is usually because there is a massive Factor 1 that captures "size factor effects". In RNA-seq data this usually occurs when the data is not properly normalised by library size. in DNA metylation data this can occur if the samples have very different global methylation rates.
This happened for a dataset with CNA and mRNA. CNA had 90% of variance explained by one factor.
VarianceExplained_MB.pdf
The input matrix for CNA is composed here of 0, 1, 2, 3 and 4.
3. Can you give me more details? what are the data modalities and their corresponding dimensionalities (`plotDataOverview`).
Attached are the plots for datasets A, D and E
DataOverview_AML.pdf
VarianceExplained_AML.pdf
DataOverview_pLGG.pdf
VarianceExplained_pLGG.pdf
DataOverview_rhabdoid.pdf
VarianceExplained_rhabdoid.pdf
Clearly the CNA data set has a "size factor" problem captured by Factor 1. This means that some samples have, globally, more copy number variation than others. Although this could be biological variation for sure, it is generally explained by technical variation when the samples are not properly normalised by library size.
You mention that the input matrix has count values 0, 1, 2, 3 and 4, right? Honestly I am not familiar with CNA data, but I would suggest three possible strategies:
(1) Normalise each sample by the total number of counts and use the gaussian likelihood. You may want to log transform to make the data look a bit more gaussian-ish
(2) Use the Poisson likelihood with the current values
(3) Binarise and use the Bernoulli likehood?
Thank you again for your answer.
On the current data, I used the Poisson distribution for CNA. I can try to binarize or normalize this data.
That is a bit weird that for some datasets, CNA explains no variance nevertheless. But I guess this is dataset dependent.
Since I used datasets from cbioportals, they might not have been processed the same way.