ICGC PCAWG-11 consensus mutation assignment

Overview

This repo contains code used to assign mutations (SNV, indel and SV) to consensus mutation clusters from subclonal reconstruction methods.

Dependencies

Software packages used to develop the code and run the pipeline on the PCAWG dataset. Installation of these packages should normally take a few minutes via Bioconductor.

R (version 3.1.0)

Internally, the pipeline calls MutationTimeR

MutationTimeR

R libraries (all installed via Bioconductor)

Bioconductor (version 3.4)
BiocInstaller (version 1.24.0)
ggplot2
gridExtra
grid

How to run

R --no-save --no-restore --vanilla -f run.R --args \
-l [path to where the repository is downloaded] \
--sam [samplename] \
-o [output directory] \
--snv [PCAWG consensus SNV VCF file] \
--cna [PCAWG consensus copy number profile] \
--struct [sample subclonal architecture] \
--pur [tumour purity] \
--ploi [tumour ploidy] \
--sex [donor sex] \
--summ ${summ_tab} \
--ind [PCAWG consensus SNV indel file] \
--sv [PCAWG consensus SV VCF file] \
--sv_vaf [output file from SVclone with VAF values and copy number mapping for each SV] \
--iswgd [provide this option when the sample has undergone a whole genome doubling]

When the pipeline has completed it is possible to generate raw cancer cell fraction estimates for each mutation by running the following command

R --no-save --no-restore --vanilla -f run.R --args \
-l [path to where the repository is downloaded] \
-i [assignment pipeline output directory] \
-o [ccf output directory]

And one can obtain probabilities of mutations being gained by running the following command

R --no-save --no-restore --vanilla -f run_prob_gained.R --args \
-l [path to where the repository is downloaded] \
-i [assignment pipeline output directory] \
-o [ccf output directory]

Produced output

This script runs the pipeline and produces the following output files

 [samplename]_subclonal_structure.txt : The consensus subclonal structure after assignment of mutations
 [samplename]_cluster_assignments.txt : Mutation to cluster assignment probabilities as established by the pipeline
 [samplename]_mutation_timing.txt : Classification of each mutation into clonal (early, late, not known) and subclonal, this is based on the cluster assignment with the highest probability
 [samplename]_assignment.RData : All output stored in an RData archive (requires loading of dependencies)
 [samplename]_pcawg11_output.RData : All PCAWG-11 output (no dependencies required)
 [samplename]_summary_table_entry.txt : A PCAWG-11 summary table entry for this sample, contains the columns provided in the input table plus summary output of this pipeline
 [samplename]_final_assignment.png : A figure showing the data and assignment, with comparison to regular binomial assignment

The text files contain the following data:

[samplename]_subclonal_structure.txt

Column	Description
cluster	Cluster id
fraction_total_cells	Fraction of total cells that the cluster presents (sometimes referred to as cellular prevalence)
fraction_cancer_cells	Fraction of tumour cells that the cluster represents (sometimes referred to as cancer cell fraction)
n_snvs	Estimated number of SNVs belonging to this cluster
n_indels	Estimated number of indels belonging to this cluster
n_svs	Estimated number of SVs belonging to this cluster

[samplename]_cluster_assignments.txt

Column	Description
chromosome	Chromosome the mutation is found
position	Position the mutation is found
mut_type	Type of mutation (SNV, Indel, SV)
cluster_*	Probability of belonging to each cluster. The column header contains a cluster id that refers to the above subclonal structure file
chromosome2	Chromosome of second end-point (SV only)
position2	Position of second end-point (SV only)
svid	SV id (SV only)

[samplename]_mutation_timing.txt

Column	Description
chromosome	Chromosome the mutation is found
position	Position the mutation is found
mut_type	Type of mutation (SNV, Indel, SV)
timing	Hard assignment timing of the mutation, this is the best guess if one has to - it is advised however to use the probabilities
chromosome2	Chromosome of second end-point (SV only)
position2	Position of second end-point (SV only)
svid	SV id (SV only)
prob_clonal_early	Probability of mutation being clonal and early
prob_clonal_late	Probability of mutation being clonal and late
prob_subclonal	Probability of mutation being subclonal

sdentro/icgc_consensus_clustering_assignment

ICGC PCAWG-11 consensus mutation assignment

Overview

Dependencies

How to run

Produced output