About the output.
b-niu opened this issue · 2 comments
Dear Professor Roth,
There are some questions about the output file.
- In the README, it's mentioned that,
cellular_prevalence_std - Standard error of the cellular_prevalence estimate.
I think, generally speaking, std means standard deviation. Is it standard error in the case of pyclone-vi?
- If we are going to calculate the 95% CI of the CCF of a mutation, should we calculate it as:
cellular_prevalence + 1.96 * cellular_prevalence_std
or
cellular_prevalence + 1.96 * cellular_prevalence_std / sqrt(size)
“size" is the size of each cluster_id.
- I have noticed that, in the output of pyclone-vi, every mutation_id in the same cluster_id shared the same CCF and std, which is different with pyclone's sitution. Is that how it's designed?
Here are two examples.
PyClone-Vi
1 mutation_id sample_id cluster_id cellular_prevalence cellular_prevalence_std cluster_assignment_prob
2 chr10_120877246_A R010_TR 0 0.9840 0.0156 0.2947
3 chr10_1284136_C R010_TR 0 0.9840 0.0156 0.4006
4 chr10_97684075_A R010_TR 0 0.9840 0.0156 0.9579
5 chr11_118243762_T R010_TR 0 0.9840 0.0156 0.3202
6 chr11_31703559_A R010_TR 0 0.9840 0.0156 0.9918
7 chr11_541549_C R010_TR 0 0.9840 0.0156 0.3006
8 chr11_57564176_G R010_TR 0 0.9840 0.0156 0.6551
9 chr11_64507462_A R010_TR 0 0.9840 0.0156 0.9929
10 chr11_66373098_C R010_TR 0 0.9840 0.0156 0.6878
11 chr11_82877429_G R010_TR 0 0.9840 0.0156 0.9945
12 chr11_82877430_A R010_TR 0 0.9840 0.0156 0.9925
13 chr12_110891640_C R010_TR 0 0.9840 0.0156 0.8822
14 chr12_49421501_C R010_TR 0 0.9840 0.0156 0.6770
15 chr12_49421502_A R010_TR 0 0.9840 0.0156 0.6770
16 chr12_57624480_C R010_TR 0 0.9840 0.0156 0.9985
17 chr13_21439555_T R010_TR 0 0.9840 0.0156 0.2805
18 chr14_102461723_C R010_TR 0 0.9840 0.0156 0.6659
19 chr14_103109576_A R010_TR 0 0.9840 0.0156 0.8032
PyClone:
1 mutation_id sample_id cluster_id cellular_prevalence cellular_prevalence_std variant_allele_frequency
2 chr10_100401647_C R003_TL8 0 0.18071347441715985 0.016903199731462922 0.06382978723404255
3 chr10_101639648_C R003_TL8 0 0.17982105619361255 0.015457517029176877 0.05825242718446602
4 chr10_101640057_G R003_TL8 0 0.1818155689181625 0.02053535541170497 0.06666666666666667
5 chr10_103454403_C R003_TL8 0 0.17823996988723376 0.015462692424096456 0.04918032786885246
6 chr10_104128994_T R003_TL8 0 0.2127829659110728 0.07472412308509058 0.13333333333333333
7 chr10_104129015_T R003_TL8 0 0.2187090823368248 0.08078620686140944 0.13953488372093023
8 chr10_104836814_C R003_TL8 0 0.2546815776966093 0.10477168311907836 0.16666666666666666
9 chr10_105160212_C R003_TL8 0 0.18108585561346474 0.014686502764975476 0.07333333333333333
10 chr10_105885264_G R003_TL8 0 0.22907172278966256 0.08973299189257782 0.15
11 chr10_105885268_C R003_TL8 0 0.20204729995466708 0.06216311605096555 0.11904761904761904
12 chr10_105956703_G R003_TL8 1 0.4397005742926062 0.07706271883953303 0.25
13 chr10_105956709_C R003_TL8 1 0.3856536712062231 0.09494088610156436 0.21311475409836064
14 chr10_112581108_G R003_TL8 0 0.19207114875896375 0.04454604366819066 0.10344827586206896
15 chr10_11308595_G R003_TL8 0 0.18812766472747872 0.036868029346953456 0.09090909090909091
16 chr10_114910828_C R003_TL8 0 0.18290222729160732 0.022546855377296227 0.0759493670886076
17 chr10_115664633_C R003_TL8 0 0.18959340326205318 0.03908659880473149 0.1
18 chr10_118891744_C R003_TL8 0 0.17147225370889752 0.024161940801417633 0.03389830508474576
19 chr10_119798701_A R003_TL8 0 0.18078720837352374 0.01691467466343686 0.06451612903225806
20 chr10_12056042_T R003_TL8 0 0.18471136799246454 0.0274178450380843 0.08333333333333333
21 chr10_12056078_T R003_TL8 0 0.1845903842023367 0.026972653638206962 0.08333333333333333
22 chr10_123256076_G R003_TL8 0 0.2194958894536334 0.0823895430972849 0.14285714285714285
-
It is the standard deviation i.e. square root of the variance. This is computed based on the posterior distribution of CCF for the cluster.
-
It would be cellular_prevalence + 1.96 * cellular_prevalence_std. That assumes the variables follow a Gaussian, which they likely don't. The posterior maybe multi-modal for example, though that is rare if the cluster has more than two mutations assigned. Probably better just to think of the standard error as relative measure of confidence to compare estimates between clusters.
-
This is expected. The CCF quoted is the mean value of the cluster the mutation is assigned to. This differs from PyClone where we compute the mean value of the CCF across the MCMC samples. The latter better represents uncertainty over clustering, but I suspect it makes little difference in practice.
Thank you, professor. My confusion has been answered.
Best regards,
Bing