
Decrypting somatic mutation patterns to reveal the evolution of cancer

Treeomics: Reconstructing metastatic seeding patterns of human cancers

Developed by: JG Reiter, AP Makohon-Moore, JM Gerold, I Bozic, K Chatterjee, C Iacobuzio-Donahue, B Vogelstein, MA Nowak.


What is Treeomics?

Treeomics is a computational tool to reconstruct the phylogeny of metastases with commonly available sequencing technologies. The tool detects putative artifacts in noisy sequencing data and infers robust evolutionary trees across a variety of evaluated scenarios. For more details, see our publication Reconstructing metastatic seeding patterns of human cancers (Nature Communications, 8, 14114, http://dx.doi.org/10.1038/ncomms14114).

  • Treeomics 1.5.2 2016-10-18: Initial release with acceptance of the manuscript.
  • Treeomics 1.6.0 2016-12-09: Improves visualization of generated evolutionary trees by integrating ETE3. ILP solver explores a pool of the best solutions to more efficiently assess the support of the inferred branches.
  • Treeomics 1.7.0 2017-02-09: Uses Bayesian inference model for similarity and artifact analyses.
  • Treeomics 1.7.1 2017-02-23: Integrated python packages pyensembl and varcode to infer the gene names where variants occurred as well as their mutation effect.
  • Treeomics 1.7.2 2017-03-02: Improved visualization of predicted driver genes in HTML report and the mutation table.
  • Treeomics 1.7.3 2017-03-13: Visualize the 5 most likely evolutionary trees. Improve solution pool usage to better estimate confidence values.
  • Treeomics 1.7.4 2017-03-15: Make mutation effect prediction by VarCode optional to reduce dependencies for users.
  • Treeomics 1.7.5 2017-04-11: Improved putative driver gene analysis and HTML report. Allow multiple normal samples. Implemented optional filter of common normal variants.
  • Treeomics 1.7.6 2017-05-12: Generate new out put file <subject>_variants.csv with information about the individual variants and how they were classified in the inferred phylogeny. Solved issues with subclone detection and solution pool.
  • Treeomics 1.7.7 2017-06-21: Made Treeomics ready for ultra deep targeted sequencing data. Fixed bug in calculation of branch confidence values in partial solution space. Use wkhtmltopdf to create a PDF from the HTML report.
  1. Open a terminal and clone the repository from GitHub with git clone https://github.com/johannesreiter/treeomics.git
  2. Install required packages:
  1. Install optional packages:
  1. Input files: The input to __main__.py is either
  • two tab-delimited text files -- one for variant read data and one for coverage data. Please see the files input/Makohon2017/Pam03_mutant_reads.txt and input/Makohon2017/Pam03_phredcoverage.txt included in this repository for examples.
  • VCF-files of all samples
  1. Go into the new folder with cd treeomics/src
  2. Type the following command to run the simulation: python treeomics -r <mut-reads table> -s <coverage table> -O where <mut-reads table> is the path to a tab-separated-value file with the number of reads reporting a variant (row) in each sample (column) and <coverage table> is the path to a tab-separated-value file with the sequencing depth at the position of this variant in each sample.
$ python treeomics -r <mut-reads table> -s <coverage table> | -v <vcf file> | -d <vcf file directory> -O
Optional parameters:
  • -e : Sequencing error rate e in the Bayesian inference model (default 1.0%)
  • -a : Maximum VAF for an absent variant fabsent before considering the estimated purity (default 5%)
  • -z : Prior probability for a variant being absent *c0 (default 0.5).
  • -o : Provide different output directory (default src/output)
  • -n : If a normal sample is provided, variants significantly present in the normal are removed. Additional normal samples can be provided via a space-separated enumeration. E.g. -n FIRSTNORMALSAMPLE SECONDNORMALSAMPLE
  • -n : Space-separated enumeration of sample names to exclude from the analysis. E.g. -x FIRSTEXCLUDEDSAMPLE SECONDEXCLUDEDSAMPLE
  • --pool_size : Number of best solutions explored by ILP solver to assess the support of the inferred branches (default 1000)
  • -b : Number of bootstrapping samples (default 0); Generally using the solution pool instead of bootstrapping seems to be the more efficient way to assess confidence.
  • -u: Enables subclone detection (default False)
  • -c : Minimum median coverage of a sample to be considered (default 0)

  • -f : Minimum median mutant allele frequency of a sample to be considered (default 0)

  • -p : False-positive rate of conventional binary classification (only relevant for artifact comparison)

  • -i : Targeted false-discovery rate of conventional binary classification (only relevant for artifact comparison)

  • -y : Minimum coverage for a powered absent variant (only relevant for artifact comparison)

  • -t : Maximum running time for CPLEX to solve the MILP (in seconds, default None). If not None, the obtained solution is no longer guaranteed to be optimal

  • --threads=<N> Maximal number of parallel threads that will be invoked by CPLEX (0: default, let CPLEX decide; 1: single threaded; N: uses up to N threads)

  • -l : Maximum number of considered mutation patterns per variant (default None). If not None, the obtained solution is no longer guaranteed to be optimal

  • --driver_genes=<path to file> Path to CSV file with names of putative driver genes highlighted in inferred phylogeny (default --driver_genes=../input/Tokheim_drivers_union.csv)

  • --wes_filtering Removes intronic and intergenic variants in WES data (default False)

  • --common_vars_file Path to file with common variants in normal samples and therefore removed from analysis (default None)

  • --no_plots Disables generation of X11 depending plots (useful for benchmarking; default plots are generated plots)

  • --no_tikztrees Disables generation of latex trees which do not depend on X11 (default latex trees are generated tikztrees)

  • --benchmarking Generates mutation matrix and mutation pattern files that can be used for automatic benchmarking of silico data (default False)

Default parameter values as well as output directory can be changed in treeomics/src/treeomics/settings.py. Moreover, the settings.py provides more options an annotation of driver genes and configuration of plot output names. All plots, analysis and logging files, and the HTML report will be in this output directory.

Optional input:
  • Driver gene annotation: Treeomics highlights any non-synonymous or splice-site variants (if VarCode is available, otherwise all) in putative driver genes given in a CSV-file under DRIVER_PATH in treeomics/src/treeomics/settings.py. As default list, the union of reported driver genes by 20/20+, TUSON, and MutsigCV from Tokheim et al. (PNAS, 2016) is used (see treeomics/src/input/Tokheim_drivers_union.csv). Any CSV-file can be used as long as there is column named 'Gene_Symbol'. Variants in these provided genes will be highlighted in the HTML report as well as in the inferred phylogeny.
  • Cancer Gene Census (CGC) annotation: Variants that have been identified as likely drivers in the provided genes (under DRIVER_PATH) will be check if they occurred in the reported region in the given CSV-file (default treeomics/src/input/cancer_gene_census_grch37_v80.csv; CGC version 80, reference genome hg19).

Example 1:

$ python treeomics -r input/Makohon2017/Pam03_1-10_mutant_reads.txt -s input/Makohon2017/Pam03_1-10_phredcoverage.txt -n Pam03N3 -e 0.005 -O

Reconstructs the phylogeny of pancreatic cancer patient Pam03 based on targeted sequencing data of 5 distinct liver metastases, 3 distinct lung metastases, and 2 samples of the primary tumor.

Example 2:

$ python treeomics -r input/Bashashati2013/Case5_mutant_reads.txt -s input/Bashashati2013/Case5_coverage.txt -e 0.005 -O

Reconstructs the phylogeny of the high-grade serous ovarian cancer of Case 5 in Bashashati et al. (2013).



If you have any questions, you can contact us (https://github.com/johannesreiter) and we will try to help.


Copyright (C) 2017 Johannes Reiter

Treeomics is licensed under the GNU General Public License, Version 3. This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, version 3 of the License. There is no warranty for this free software.


Author: Johannes Reiter, Harvard University, http://www.people.fas.harvard.edu/~reiter