/icgc_consensus_clustering_assignment

ICGC PCAWG-11 pipeline used for assignment of SNVs, Indels and SVs to consensus mutation clusters

Primary LanguageRGNU Affero General Public License v3.0AGPL-3.0

ICGC PCAWG-11 consensus mutation assignment

Overview

This repo contains code used to assign mutations (SNV, indel and SV) to consensus mutation clusters from subclonal reconstruction methods.

Dependencies

Software packages used to develop the code and run the pipeline on the PCAWG dataset. Installation of these packages should normally take a few minutes via Bioconductor.

R (version 3.1.0)

Internally, the pipeline calls MutationTimeR

MutationTimeR

R libraries (all installed via Bioconductor)

Bioconductor (version 3.4)
BiocInstaller (version 1.24.0)
ggplot2
gridExtra
grid

How to run

R --no-save --no-restore --vanilla -f run.R --args \
-l [path to where the repository is downloaded] \
--sam [samplename] \
-o [output directory] \
--snv [PCAWG consensus SNV VCF file] \
--cna [PCAWG consensus copy number profile] \
--struct [sample subclonal architecture] \
--pur [tumour purity] \
--ploi [tumour ploidy] \
--sex [donor sex] \
--summ ${summ_tab} \
--ind [PCAWG consensus SNV indel file] \
--sv [PCAWG consensus SV VCF file] \
--sv_vaf [output file from SVclone with VAF values and copy number mapping for each SV] \
--iswgd [provide this option when the sample has undergone a whole genome doubling]

When the pipeline has completed it is possible to generate raw cancer cell fraction estimates for each mutation by running the following command

R --no-save --no-restore --vanilla -f run.R --args \
-l [path to where the repository is downloaded] \
-i [assignment pipeline output directory] \
-o [ccf output directory]

And one can obtain probabilities of mutations being gained by running the following command

R --no-save --no-restore --vanilla -f run_prob_gained.R --args \
-l [path to where the repository is downloaded] \
-i [assignment pipeline output directory] \
-o [ccf output directory]

Produced output

This script runs the pipeline and produces the following output files

 [samplename]_subclonal_structure.txt : The consensus subclonal structure after assignment of mutations
 [samplename]_cluster_assignments.txt : Mutation to cluster assignment probabilities as established by the pipeline
 [samplename]_mutation_timing.txt : Classification of each mutation into clonal (early, late, not known) and subclonal, this is based on the cluster assignment with the highest probability
 [samplename]_assignment.RData : All output stored in an RData archive (requires loading of dependencies)
 [samplename]_pcawg11_output.RData : All PCAWG-11 output (no dependencies required)
 [samplename]_summary_table_entry.txt : A PCAWG-11 summary table entry for this sample, contains the columns provided in the input table plus summary output of this pipeline
 [samplename]_final_assignment.png : A figure showing the data and assignment, with comparison to regular binomial assignment

The text files contain the following data:

[samplename]_subclonal_structure.txt

Column Description
cluster Cluster id
fraction_total_cells Fraction of total cells that the cluster presents (sometimes referred to as cellular prevalence)
fraction_cancer_cells Fraction of tumour cells that the cluster represents (sometimes referred to as cancer cell fraction)
n_snvs Estimated number of SNVs belonging to this cluster
n_indels Estimated number of indels belonging to this cluster
n_svs Estimated number of SVs belonging to this cluster

[samplename]_cluster_assignments.txt

Column Description
chromosome Chromosome the mutation is found
position Position the mutation is found
mut_type Type of mutation (SNV, Indel, SV)
cluster_* Probability of belonging to each cluster. The column header contains a cluster id that refers to the above subclonal structure file
chromosome2 Chromosome of second end-point (SV only)
position2 Position of second end-point (SV only)
svid SV id (SV only)

[samplename]_mutation_timing.txt

Column Description
chromosome Chromosome the mutation is found
position Position the mutation is found
mut_type Type of mutation (SNV, Indel, SV)
timing Hard assignment timing of the mutation, this is the best guess if one has to - it is advised however to use the probabilities
chromosome2 Chromosome of second end-point (SV only)
position2 Position of second end-point (SV only)
svid SV id (SV only)
prob_clonal_early Probability of mutation being clonal and early
prob_clonal_late Probability of mutation being clonal and late
prob_subclonal Probability of mutation being subclonal