Model written in Python for deconvolution and phylogeny inference of bulk tumor genomic data using structural variants and copy number segments. This model requires you to obtain a Gurobi license (free for academics).
The pre-print of our paper can be found on bioRXiv here https://www.biorxiv.org/content/early/2018/01/30/257014.
The BibTeX citation for this paper is below.
@article {Eaton257014,
author = {Eaton, Jesse and Wang, Jingyi and Schwartz, Russell},
title = {Deconvolution and phylogeny inference of structural variations in tumor genomic samples},
year = {2018},
doi = {10.1101/257014},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2018/01/30/257014},
eprint = {https://www.biorxiv.org/content/early/2018/01/30/257014.full.pdf},
journal = {bioRxiv}
}
Navigate to a directory where you would like to keep the tusv project. Run the command
git clone https://github.com/jaebird123/tusv.git
If you do not have the git
command, download and install it from here https://git-scm.com/book/en/v2/Getting-Started-Installing-Git.
The packages you will need to install are listed below.
numpy
graphviz
ete2
biopython
gurobipy
To install these, you will need the pip
command which comes pre-installed with Python. If you do not have the pip
command, download and install it from here https://pip.pypa.io/en/stable/installing/. Then run the command pip install <package_name>
for each of the above packages.
To obtain a Gurobi license, sign up as an academic user here http://www.gurobi.com/registration/academic-license-reg and follow the instructions for downloading a license. If you are outside of academia, I sincerely apologize as the license is very expensive.
To check that you are able to use Gurobi, after obtaining a license, navigate to the model/
directory with
cd tusv/model/
and run the command
python test_solver.py
If you have successfully installed gurobipy
and obtained a Gurobi license, the script will output the standard Gurobi solver text and finish executing after approximately 30 seconds.
Two python scripts can be run.
- tusv.py
- multi_tusv.py
The script tusv.py
takes as input a single directory containing one or multiple .vcf
files. Go here https://samtools.github.io/hts-specs/VCFv4.2.pdf for specifications on the .vcf
format. Sample input can be found in tusv/data/example/patient1/
. Each .vcf
file corresponds to a patient sample and all records inside each .vcf
file contain the structural variants and copy number segments for that sample. The script multi_tusv.py
is a wrapper script of tusv.py
and takes as input a directory where each subdirectory contains one or multiple .vcf
files. Both scripts take required inputs
-n
number of leaves to infer in phylogenetic tree-c
maximum copy number allowed for any breakpoint or segment on any node-t
maximum number of coordinate-descent iterations (program can finish sooner if convergence is reached)-r
number of random initializations of the coordinate-descent algorithm
and optional inputs
-b
(recommended) binary flag to automatically set hyper-parameters lambda 1 and lambda 2-l
lambda 1 hyper-parameter. controls phylogenetic cost-a
lambda 2 hyper-parameter. controls breakpoint to segment consistancy-m
maximum time (in seconds) for a single cordinate-descent iteration-s
number of segments (in addition to those containing breakpoints) that are randomly kept for unmixing. default keeps all segments-p
(not recommended) number of processors to use. uses all available processors by default-d
(not recommended) file containing metadata information for output .vcf files
tusv requires mixed copy numbers of segments and breakpoints for a bulk tumor. We recommend using Weaver which can be found here https://github.com/ma-compbio/Weaver before running tusv.
Authors: Jesse Eaton and Jingyi Wang
License: MIT https://github.com/jaebird123/tusv/blob/master/LICENSE
Acknowledgements: We thank Ashok Rajaraman and Jian Ma for helpful discussions and assistance with Weaver. Portions of this work have been funded by the U.S. National Institutes of Health via award R21CA216452 and Pennsylvania Dept. of Health Grant GBMF4554 4100070287. The Pennsylvania Department of Health specifically disclaims responsibility for any analyses, interpretations or conclusions. This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number OCI-1053575. Specifically, it used the Bridges system, which is supported by NSF award number ACI-1445606, at the Pittsburgh Supercomputing Center (PSC). The results published here are in whole or part based upon data generated by The Cancer Genome Atlas managed by the NCI and NHGRI. Information about TCGA can be found at http://cancergenome.nih.gov.