This repo contains code used to assign mutations (SNV, indel and SV) to consensus mutation clusters from subclonal reconstruction methods.
Software packages used to develop the code and run the pipeline on the PCAWG dataset. Installation of these packages should normally take a few minutes via Bioconductor.
R (version 3.1.0)
Internally, the pipeline calls MutationTimeR
MutationTimeR
R libraries (all installed via Bioconductor)
Bioconductor (version 3.4)
BiocInstaller (version 1.24.0)
ggplot2
gridExtra
grid
R --no-save --no-restore --vanilla -f run.R --args \
-l [path to where the repository is downloaded] \
--sam [samplename] \
-o [output directory] \
--snv [PCAWG consensus SNV VCF file] \
--cna [PCAWG consensus copy number profile] \
--struct [sample subclonal architecture] \
--pur [tumour purity] \
--ploi [tumour ploidy] \
--sex [donor sex] \
--summ ${summ_tab} \
--ind [PCAWG consensus SNV indel file] \
--sv [PCAWG consensus SV VCF file] \
--sv_vaf [output file from SVclone with VAF values and copy number mapping for each SV] \
--iswgd [provide this option when the sample has undergone a whole genome doubling]
When the pipeline has completed it is possible to generate raw cancer cell fraction estimates for each mutation by running the following command
R --no-save --no-restore --vanilla -f run.R --args \
-l [path to where the repository is downloaded] \
-i [assignment pipeline output directory] \
-o [ccf output directory]
And one can obtain probabilities of mutations being gained by running the following command
R --no-save --no-restore --vanilla -f run_prob_gained.R --args \
-l [path to where the repository is downloaded] \
-i [assignment pipeline output directory] \
-o [ccf output directory]
This script runs the pipeline and produces the following output files
[samplename]_subclonal_structure.txt : The consensus subclonal structure after assignment of mutations
[samplename]_cluster_assignments.txt : Mutation to cluster assignment probabilities as established by the pipeline
[samplename]_mutation_timing.txt : Classification of each mutation into clonal (early, late, not known) and subclonal, this is based on the cluster assignment with the highest probability
[samplename]_assignment.RData : All output stored in an RData archive (requires loading of dependencies)
[samplename]_pcawg11_output.RData : All PCAWG-11 output (no dependencies required)
[samplename]_summary_table_entry.txt : A PCAWG-11 summary table entry for this sample, contains the columns provided in the input table plus summary output of this pipeline
[samplename]_final_assignment.png : A figure showing the data and assignment, with comparison to regular binomial assignment
The text files contain the following data:
[samplename]_subclonal_structure.txt
Column | Description |
---|---|
cluster | Cluster id |
fraction_total_cells | Fraction of total cells that the cluster presents (sometimes referred to as cellular prevalence) |
fraction_cancer_cells | Fraction of tumour cells that the cluster represents (sometimes referred to as cancer cell fraction) |
n_snvs | Estimated number of SNVs belonging to this cluster |
n_indels | Estimated number of indels belonging to this cluster |
n_svs | Estimated number of SVs belonging to this cluster |
[samplename]_cluster_assignments.txt
Column | Description |
---|---|
chromosome | Chromosome the mutation is found |
position | Position the mutation is found |
mut_type | Type of mutation (SNV, Indel, SV) |
cluster_* | Probability of belonging to each cluster. The column header contains a cluster id that refers to the above subclonal structure file |
chromosome2 | Chromosome of second end-point (SV only) |
position2 | Position of second end-point (SV only) |
svid | SV id (SV only) |
[samplename]_mutation_timing.txt
Column | Description |
---|---|
chromosome | Chromosome the mutation is found |
position | Position the mutation is found |
mut_type | Type of mutation (SNV, Indel, SV) |
timing | Hard assignment timing of the mutation, this is the best guess if one has to - it is advised however to use the probabilities |
chromosome2 | Chromosome of second end-point (SV only) |
position2 | Position of second end-point (SV only) |
svid | SV id (SV only) |
prob_clonal_early | Probability of mutation being clonal and early |
prob_clonal_late | Probability of mutation being clonal and late |
prob_subclonal | Probability of mutation being subclonal |