/MutationalPatterns

R package for extracting and visualizing mutational patterns in SNV data

Primary LanguageRMIT LicenseMIT

MutationalPatterns

The MutationalPatterns R package provides a comprehensive set of flexible functions for easy finding and plotting of mutational patterns in Single Nucleotide Variant (SNV) data.

Please give credit and cite MutationalPatterns R Package when you use it for your data analysis. For information on how to cite this package in your publication execute:

citation("MutationalPatterns")

Getting started

Installation

This package is dependent on R version 3.2.4

Install and load devtools package

install.packages("devtools")
library(devtools)

Install and load MutationalPatterns package

options(unzip = 'internal')
install_github("CuppenResearch/MutationalPatterns")
library(MutationalPatterns)

Reference genome

  1. List all available reference genomes (BSgenome)
available.genomes()
  1. Download and load your reference genome of interest
ref_genome = "BSgenome.Hsapiens.UCSC.hg19"
source("http://bioconductor.org/biocLite.R")
biocLite(ref_genome)
library(ref_genome, character.only = T)

Load SNV data

This package is for the analysis of patterns in SNV data only, therefore the vcf files should not contain indel positions.

Find package SNV example data

vcf_files = list.files(system.file("extdata", package="MutationalPatterns"), full.names = T)

Load a single vcf file

vcf = read_vcf(vcf_files[1], "sample1")

Load a list of vcf files

sample_names = c("colon1", "colon2", "colon3", "intestine1", "intestine2", "intestine3", "liver1", "liver2", "liver3")
vcfs = read_vcf(vcf_files, sample_names)

Include relevant metadata in your analysis, e.g. donor id, cell type, age, tissue type, mutant or wild type

tissue = c("colon", "colon", "colon", "intestine", "intestine", "intestine", "liver", "liver", "liver")

Analyses

Mutation types

Retrieve base substitutions from vcf object as "REF>ALT"

get_muts(vcfs[[1]])

Retrieve base substitutions from vcf and convert to the 6 types of base substitution types that are distinguished by convention: C>A, C>G, C>T, T>A, T>C, T>G. For example, if the reference allele is G and the alternative allele is T (G>T), this functions returns the G:C>T:A mutation as a C>A mutation.

get_types(vcfs[[1]])

Retrieve the context (1 base upstream and 1 base downstream) of the positions in the vcf object from the reference genome.

get_mut_context(vcfs[[1]], ref_genome)

Retrieve the types and context of the base substitution types for all positions in the vcf object. For the base substitutions that are converted to the conventional base substitution types, the reverse complement of the context is returned.

get_type_context(vcfs[[1]], ref_genome)

Count mutation type occurences for one vcf object

type_occurences = mut_type_occurences(vcfs[1], ref_genome)

Count mutation type occurences for all samples in a list of vcf objects

type_occurences = mut_type_occurences(vcfs, ref_genome)

Mutation spectrum

Plot mutation spectrum over all samples. Plottes is the mean relative contribution of each of the 6 base substitution types. Error bars indicate standard deviation over all samples. The n indicates the total number of mutations in the set.

plot_spectrum(type_occurences)

Plot mutation spectrum with distinction between C>T at CpG sites

plot_spectrum(type_occurences, CT = T)

Plot spectrum without legend

plot_spectrum(type_occurences, CT = T, legend = F)

spectra1

Plot spectrum for each tissue separately

plot_spectrum(type_occurences, by = tissue, CT = T)

Specify 7 colors for spectrum plotting

my_colors = c("pink", "orange", "blue", "lightblue", "green", "red", "purple")
plot_spectrum(type_occurences, CT = T, legend = T, colors = my_colors)

spectra2

96 Mutation Profile

Make 96 trinucleodide mutation count matrix

mut_matrix = make_mut_matrix(vcf_list = vcfs, ref_genome = ref_genome)

Plot 96 profile of three samples

plot_96_profile(mut_matrix[,c(1,4,7)])

96_mutation_profile

Extract Signatures

Estimate optimal rank for NMF mutation matrix decomposition

estimate_rank(mut_matrix, rank_range = 2:5, nrun = 50)

estim_rank

Extract and plot 3 signatures

nmf_res = extract_signatures(mut_matrix, rank = 3)
# provide signature names (optional)
colnames(nmf_res$signatures) = c("Signature A", "Signature B" , "Signature C")
# plot signatures
plot_96_profile(nmf_res$signatures)

signatures

Plot signature contribution

# provide signature names (optional)
rownames(nmf_res$contribution) = c("Signature A", "Signature B" , "Signature C")
# plot signature contribution
plot_contribution(nmf_res$contribution, coord_flip = T)

contribution

Compare reconstructed mutation profile with original mutation profile

plot_compare_profiles(mut_matrix[,1], nmf_res$reconstructed[,1], profile_names = c("Original", "Reconstructed"))

originalVSreconstructed

Fit 96 mutation profiles to known signatures

Download signatures from pan-cancer study Alexandrov et al.

url = "http://cancer.sanger.ac.uk/cancergenome/assets/signatures_probabilities.txt"
cancer_signatures = read.table(url, sep = "\t", header = T)
# reorder (to make the order of the trinucleotide changes the same)
cancer_signatures = cancer_signatures[order(cancer_signatures[,1]),]
# only signatures in matrix
cancer_signatures = as.matrix(cancer_signatures[,4:33])

Fit mutation matrix to cancer signatures. This function finds the optimal linear combination of mutation signatures that most closely reconstructs the mutation matrix by solving nonnegative least-squares constraints problem

fit_res = fit_to_signatures(mut_matrix, cancer_signatures)
# select signatures with some contribution
select = which(rowSums(fit_res$contribution) > 0)
# plot contribution
plot_contribution(fit_res$contribution[select,], coord_flip = T)

signatures

Compare reconstructed mutation profile of sample 1 using cancer signatures with original profile

plot_compare_profiles(mut_matrix[,1], fit_res$reconstructed[,1], profile_names = c("Original", "Reconstructed \n cancer signatures"))

contribution

Rainfall plot

A rainfall plot visualizes mutation types and intermutation distance. Rainfall plots can be used to visualize the distribution of mutations along the genome or a subset of chromosomes. The y-axis corresponds to the distance of a mutation with the previous mutation and is log10 transformed. Drop-downs from the plots indicate clusters or "hotspots" of mutations.

Make rainfall plot of sample 1 over all autosomal chromosomes

# define autosomal chromosomes
chromosomes = seqnames(get(ref_genome))[1:22]
# make rainfall plot
plot_rainfall(vcfs[[1]], title = names(vcfs[1]), ref_genome = ref_genome, chromosomes = chromosomes, cex = 1.5)

rainfall1

Make rainfall plot of sample 1 over chromosome 1

chromosomes = seqnames(get(ref_genome))[1]
plot_rainfall(vcfs[[1]], title = names(vcfs[1]), ref_genome = ref_genome, chromosomes = chromosomes[1], cex = 2)

rainfall2

Things to do

1 Genomic distribution

  • genomic regions bed input
  • surveyed area??

2 Strand bias analysis

  • get gene bodies for each ref genome