/MutationTimeR

An R package to time somatic mutations

Primary LanguageR

MutationTimeR

MutationTimeR is an R package to time somatic mutations relative to clonal and subclonal copy number states and calculate the relative timing of copy number gains. Time is measured as a fraction of point mutations; this is termed mutation time. Mutation time is proportional to real time if the number of mutations acquired per bp and per year is constant.

MutationTimeR has been used by the PCAWG consortium to calculate mutation times of 2,778 whole genome sequencing samples. Please see M. Gerstung, C. Jolly, I. Leshchiner, S. Dentro, S. Gonzalez et al., The Evolutionary History of 2,658 Cancers, Nature. 578, pages 122-128(2020).

Installation

MutationTimeR runs in most current R versions. You can install and load MutationTimeR using

devtools::install_github("mg14/mg14")
devtools::install_github("gerstung-lab/MutationTimeR")

Input data

MutationTimeR requires a vcf object containing point mutations (SNVs, MNVs, or indels) and a GRanges object with the copy number segments of the sample.

library("MutationTimeR")
#vcf <- readVcf("myvcf.vcf") # Point mutations, needs `geno` entries `AD` and `DP` or `info` columns t_alt_count t_ref_count.
#bb <- GRanges(, major_cn= , minor_cn=, clonal_frequency=purity) # Copy number segments, needs columns  major_cn, minor_cn and clonal_frequency of each segment
#clusters <- data.frame(cluster= , n_ssms=, proportion=) # Optional data.frame with subclonal cluster locations (VAF proportion) and size (number of variants n_ssms)

Here we cut this short by

data(MutationTimeR)

The vcf object containing all point mutations is:

vcf
## class: ExpandedVCF 
## dim: 5499 1 
## rowRanges(vcf):
##   GRanges with 4 metadata columns: REF, ALT, QUAL, FILTER
## info(vcf):
##   DataFrame with 2 columns: t_alt_count, t_ref_count
## info(header(vcf)):
##                Number Type    Description     
##    t_alt_count 1      Integer Tumour alt count
##    t_ref_count 1      Integer Tumour ref count
## geno(vcf):
##   SimpleList of length 3: AD, DP, FT
## geno(header(vcf)):
##       Number Type    Description                                             
##    AD 2      Integer Allelic depths (number of reads in each observed allele)
##    DP 1      Integer Total read depth                                        
##    FT 1      String  Variant filters

The GRanges with allele-specific copy number data

bb
## GRanges object with 80 ranges and 3 metadata columns:
##        seqnames              ranges strand |  major_cn  minor_cn
##           <Rle>           <IRanges>  <Rle> | <integer> <integer>
##    [1]        1     17500-144915331      * |         1         1
##    [2]        1 144915332-147959385      * |         2         1
##    [3]        1 147959386-149718742      * |         2         1
##    [4]        1 149718743-249250620      * |         2         1
##    [5]        2     17500-242979500      * |         1         1
##    ...      ...                 ...    ... .       ...       ...
##   [76]       20   25006082-25168069      * |         1         1
##   [77]       20   25168070-29532764      * |         1         0
##   [78]       20   29532765-63025519      * |         1         1
##   [79]       21          1-48129894      * |         1         1
##   [80]       22   12200000-51304565      * |         1         0
##        clonal_frequency
##               <numeric>
##    [1]             0.54
##    [2]             0.54
##    [3]             0.54
##    [4]             0.54
##    [5]             0.54
##    ...              ...
##   [76]             0.54
##   [77]             0.54
##   [78]             0.54
##   [79]             0.54
##   [80]             0.54
##   -------
##   seqinfo: 22 sequences from an unspecified genome; no seqlengths

Lastly, the prevalence and number of subclonal is

clusters
##   cluster n_ssms proportion
## 1       0   5518       0.54
## 2       1    845       0.23

Running MutationTimeR

To run MutationTimeR simply use

mt <- mutationTime(vcf, bb, clusters=clusters, n.boot=10)

Annotation of point mutations

MutationTimer produces two main outputs. The first is the probability for each individual point mutation from the original vcf object to belong to different copy number levels:

head(mt$V)
## DataFrame with 6 rows and 15 columns
##       MutCN MutDeltaCN     MajCN     MinCN  MajDerCN  MinDerCN       CNF
##   <numeric>  <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## 1         1          0         1         1         1         1      0.54
## 2         1          0         1         1         1         1      0.54
## 3         1          0         1         1         1         1      0.54
## 4         1          0         1         1         1         1      0.54
## 5         1          0         1         1         1         1      0.54
## 6         1          0         1         1         1         1      0.23
##            CNID            pMutCN     pGain              pSingle
##   <IntegerList>         <numeric> <numeric>            <numeric>
## 1             1 0.999999992268111         0    0.999999992268111
## 2             1   0.9999753674765         0      0.9999753674765
## 3             1 0.999937495120268         0    0.999937495120268
## 4             1 0.999999348396561         0    0.999999348396561
## 5             1 0.999999960786733         0    0.999999960786733
## 6             1 0.999795151536053         0 0.000204848463947305
##                   pSub pAllSubclones         pMutCNTail         CLS
##              <numeric>        <List>          <numeric>    <factor>
## 1 7.73188863450679e-09                0.976558542742796 clonal [NA]
## 2  2.4632523499745e-05                0.653374217923694 clonal [NA]
## 3 6.25048797320028e-05                0.336557312762548 clonal [NA]
## 4 6.51603439185877e-07                0.926037697462508 clonal [NA]
## 5 3.92132672834029e-08                0.940677539480767 clonal [NA]
## 6    0.999795151536053               0.0814516989630261   subclonal

These probabilities are the basis for the following simple clonal states (early clonal/late clonal/clonal/subclonal/NA)

table(mt$V$CLS)
## 
## clonal [early]  clonal [late]    clonal [NA]      subclonal 
##            198             97           4356            848

To add this annotation to the vcf use

vcf <- addMutTime(vcf, mt$V)

Timings of copy number gains

The proportion of point mutations in different copy number states, in turn, defines the molecular timing of gains. These are stored in the following DataFrame:

head(mt$T)
## DataFrame with 6 rows and 10 columns
##                                                                        timing_param
##                                                                              <List>
## 1                                                    1:1:0.27:...,2:1:0.115:...,...
## 2 1:1:0.21259842519685:...,1:2:0.425196850393701:...,2:1:0.0905511811023622:...,...
## 3 1:1:0.21259842519685:...,1:2:0.425196850393701:...,2:1:0.0905511811023622:...,...
## 4 1:1:0.21259842519685:...,1:2:0.425196850393701:...,2:1:0.0905511811023622:...,...
## 5                                                    1:1:0.27:...,2:1:0.115:...,...
## 6                                                    1:1:0.27:...,2:1:0.115:...,...
##                type              time           time.lo           time.up
##            <factor>         <numeric>         <numeric>         <numeric>
## 1                NA                NA                NA                NA
## 2 Mono-allelic Gain                 1 0.485179575172229                 1
## 3 Mono-allelic Gain                NA                NA                NA
## 4 Mono-allelic Gain 0.773534222763912 0.666501932404523 0.815455093388117
## 5                NA                NA                NA                NA
## 6                NA                NA                NA                NA
##    time.2nd time.2nd.lo time.2nd.up time.star n.snv_mnv
##   <numeric>   <numeric>   <numeric>  <factor> <integer>
## 1        NA          NA          NA        NA       148
## 2        NA          NA          NA       ***         8
## 3        NA          NA          NA        NA         0
## 4        NA          NA          NA       ***       128
## 5        NA          NA          NA        NA       304
## 6        NA          NA          NA        NA        22

The relevant columns are time with 95% confidence intervals time.lo and time.hi as well as the counterparts time2/time2.lo/time2.hi for the second gain in cases where one allele has more than one gained copy. The field time.star indicates a tiers: *** indicates gains +1. ** gains +2, which are found to be slightly less reliable and need certain assumptions about their temporal sequence. * are subclonal gains which are hit an miss.

The DataFrame can be added to the copy number GRanges object for convenience.

mcols(bb) <- cbind(mcols(bb),mt$T)

Plot output

Timing annotated VCF and copy number can be plotted using the following command:

plotSample(vcf,bb)

This shows the observed and expected variant allele frequencies of point mutations on the top. This is very useful to spot inconsistencies with purity and copy number configuration. As a rule of thumb the states (horizontal bars) should run right through the middle of the clouds of point mutations. Colours indicate the timing category: Blue = clonal [other], purple = clonal [late], green = clonal [early], red = subclonal.

The middle plot shows the copy number as stacked barplots. Subclonal CN is indicated by fractional bars. Dark grey is major, light grey minor allele.

The bottom plot shows the estimated mutation time of primary and secondary gains (shaded). Boxes denote 95% CIs. The histogram at the right shows the distribution of timing events. Blue = mono-allelic gains (N:1), pink = CN-LOH/gain+loss (N:0) and green = bi-allelic gains (N:2).