See corresponding papers for full details Reiter et al, An analysis of genetic heterogeneity in untreated cancers (Nature Reviews Cancer, 2019) and Reiter, Makohon-Moore, Gerold et al, Minimal functional driver gene heterogeneity among untreated metastases (Science, 2018)
========
LiFD is a two-phase algorithm to predict likely functional driver (LiFD) mutations that integrates information from multiple databases and bioinformatic methods.
During the first phase of LiFD, variants that are present in OncoKB, the catalog of validated oncogenic mutations (CGI, Cancer Genome Interpreter; driver prediction “known”), known cancer hotspots, or present at least 4 times in COSMIC (Catalogue of Somatic Mutations in Cancer) are annotated as functional.
If a variant is not annotated as functional in the first phase, LiFD uses CHASMplus, FATHMM, CanDrA+, CGI, and VEP to predict the functional consequences of individual mutations in the second phase.
By default, LiFD requires a q-value of at most 0.1 for CHASMplus.
For FATHMM, LiFD uses the recommended threshold of at most -0.75.
For CanDrA, LiFD uses a significance threshold of 0.05 for driver predictions.
For CGI, LiFD requires “predicted” in its driver classification column and for VEP LiFD requires “HIGH” in its predicted impact column.
If the majority (>50%) of the methods that produce a valid result predict functionality, LiFD annotates the mutation as likely functional and otherwise as unlikely functional.
Default reference genome is hg19. Only some of the pooled tools support hg20.
- LiFD 0.1.0 2019-08-27: Initial release.
- Install Python 3.6 (https://www.python.org/downloads). Check installation with
which python3.6
. Load Python 3.6 withml python/3.6
if installing LiFD onto a remote server/cluster. - Install required packages with
pip install numpy scipy pandas statsmodels xlrd openpyxl xlsxwriter
; NumPy (http://www.numpy.org), SciPy (http://www.numpy.org), pandas (http://pandas.pydata.org/). - Install PyEnsembl (https://github.com/openvax/pyensembl) and Varcode (https://github.com/openvax/varcode), used for mutation effect annotation, with
pip install pyensembl varcode
. Runpyensembl install --release 75 --species human
to get the latest genome database for hg19/GRCh37. - Open a terminal and clone the repository from GitHub with
git clone https://github.com/johannesreiter/LiFD_dev.git
and install LiFD to your python environment by runningpip3.6 install -e <LiFD_dev_directory>
- Test installation by opening a python shell
python3.6
and execute these commandsimport lifd
andlifd.__version__
. As output you should get your current LiFD version. Exit the shell withexit()
.
LiFD takes as input either a CSV or an Excel file where the following columns are required for the different methods.
See directory examples
and files SupplementaryTable1.xlsx
or SupplementaryTable3.xlsx
.
If boolean column MaybeFunctional
(denoting whether variant is a nonsynonymous or splice-site mutation) is not provided, then VarCode will be invoked to assess mutation effects.
Either a boolean column DriGeneClf
(denoting whether variant occurred in putative driver gene) needs to be provided or a CSV file with list of putative driver genes needs to be given (e.g., examples/BaileyDing2018_driverconsensus.csv
).
Default settings can be configured in lifd/settings.py
- CGI (known): Requires boolean column
In_CGI_Catalog
- COSMIC: Requires numerical column
In_COSMIC
- Hotspots: Requires boolean column
In_Hotspots
- OncoKB: Requires boolean column
In_OncoKB
- CGI (prediction): Requires column
CGI_driver
- CanDrA: Requires
CanDrA_clf
andCanDrA_sig
- CHASMplus: Requires
CHASMplus_CT_Pvalue_corr
- FATHMM: Requires column
FATHMM_score
- VEP: Requires column
VEP_impact
For a full example see the ipython notebook examples/intraprimary_heterogeneity_analysis.ipynb
or examples/intermetastatic_heterogeneity_analysis.ipynb
To automatically invoke the various methods, various dependencies need to be configured which can be very cumbersome.
Download the following databases into a directory and set variable DB_DIR
with path to the databases in lifd/settings.py
(e.g., /src/lifd/databases
):
- OncoKB v1.21 (https://github.com/oncokb/oncokb-public/blob/master/data/v1.21/allAnnotatedVariants.txt); find all versions here (https://github.com/oncokb/oncokb-public/tree/master/data). Set variable
ONCOKB_ALLVARS_FP
with path to downloaded file inlifd/settings.py
accordingly. - Cancer Hotspots V2 (https://www.cancerhotspots.org/#/download). Set variable
HOTSPOTS_FP
with path to downloaded file inlifd/settings.py
accordingly. - COSMIC Mutation Data Genome Screens v89 (https://cancer.sanger.ac.uk/cosmic/download?genome=37). Set variable
COSMIC_VARS_FP
with path to downloaded file inlifd/settings.py
accordingly. Note that in the original study, version 1 of the hotspots were used. - CGI Catalog of Validated Oncogenic Mutations (https://www.cancergenomeinterpreter.org/mutations). Set variable
ONCOGENIC_VARS_FP
with path to downloaded file inlifd/settings.py
accordingly.
Install Pysam (https://github.com/pysam-developers/pysam), used to find reference alleles for indels, with pip3.6 install pysam
. Download a fasta file for hg19 (http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz) into the database directory.
Setup the predictive tools (and their cancer-specific models) to create annotations for LiFD automatically:
-
CanDrA.v+ (https://bioinformatics.mdanderson.org/main/CanDrA)
-
CRAVAT and CHASMplus (https://chasmplus.readthedocs.io/en/latest/)
-
CGI is a web tool and no installation is required. However, a login needs to be created at (https://www.cancergenomeinterpreter.org). A file
cgi_settings.py
containing the credentials is expected atsrc/lifd/cgi_settings.py
in the following format:CGI_USER_ID = '<YOUR USERNAME>' CGI_TOKEN = '<YOUR CGI TOKEN>'
-
FATHMM (http://fathmm.biocompute.org.uk/downloads.html)
- The files for FATHMM (
fathmm.py
andconfig.ini
) should be located underDB_DIR
in the folderfathmm/cgi-bin
. The following edits should be made tofathmm.py
to make the file compatible with Python 3, if desired:- Change all lines "except Exception, e:" to "except Exception as e:"
- Change line "import ConfigParser" to "import configparser as ConfigParser"
- FATHMM requires MySQL (https://www.mysql.com/downloads/) or MariaDB. If using MariaDB, which is common for remote servers/clusters, make the additional change below to replace the line that initializes dbCursor:
Load MariaDB with
mariadb_connection = mariadb.connect( host = str(Config.get("DATABASE", "HOST")), port = int(Config.get("DATABASE", "PORT")), unix_socket = int(Config.get("DATABASE", "UNIX_SOCKET")), user = str(Config.get("DATABASE", "USER")), password = str(Config.get("DATABASE", "PASSWD")), database = str(Config.get("DATABASE", "DB"))) dbCursor = mariadb_connection.cursor(dictionary = True)
ml system mariadb
. If using MySQL, start server withmysql.server start
- Check the line
COMMAND = 'python3 fathmm.py -w Cancer {} {}'
inlifd/databases/fathmm.py
and correct the line to reflect the version of Python. Note that a separate installation of Python 2 will be necessary to run fathmm.py if the above corrections are not made. - As written on the FATHMM installation page, place information about the user, database, etc. in
config.ini
file as follows:
[DATABASE] HOST = <MySQL Host> PORT = <MySQL Port> USER = <MySQL Username> PASSWD = <MySQL Password> DB = fathmm
It may be necessary to add
UNIX_SOCKET = <MySQL/MariaDB Socket>
to the FATHMM config file, especially for a remote server implementation. On a remote server, runmysqld_safe
on the command line to start a mysqld server, which blocks for MySQL commands. Terminal multiplexers such as tmux are espsecially useful for this task.- Download MySQLdb with
pip install MySQL-python
for Python 2 andpip install mysqlclient
for Python 3 if using MySQL. For MariaDB, download mysql.connector withpip install mysql-connector-python
. Delete allimport MySQLdb statements
and insertimport mysql.connector as mariadb
in the downloadedfathmm.py
file.
- The files for FATHMM (
-
VEP (https://uswest.ensembl.org/info/docs/tools/vep/script/vep_download.html)
- VEP requires Perl (https://www.perl.org/get.html) and MySQL (https://www.mysql.com/downloads/). In a remote server, load Perl and MariaDB with
ml perl
andml system mariadb
, respectively. Install the required Perl packages (Archive::Zip, DBD::mysql, DBI) withcpanm [package]
before running the VEP setup file withperl INSTALL.pl
. The directory of the cache can be assigned with an optional argument (--CACHEDIR [dir]
) during setup. Set variableVEP_CACHE
with path to VEP cache (initially set to the default location of the cache). - Note that some dependencies of various Perl packages may fail during either installation of the requirements or the setup, such as
Test::Pod::Coverage
orXML::DOM::XPath
; keep track of these packages and install them separately, possibly using--force
argument to circumnavigate outdated tests. Dependencies of dependencies may also fail, in which case this process should be repeated. - Note that many packages may already exist on a remote server. Load some of the packages with
ml libgd
andml biology htslib
.
- VEP requires Perl (https://www.perl.org/get.html) and MySQL (https://www.mysql.com/downloads/). In a remote server, load Perl and MariaDB with
LiFD takes as input either a CSV or an Excel file with the following required columns: Chromosome
, StartPosition
, EndPosition
, ReferenceAllele
, and AlternateAllele
.
See examples/example_variants.xlsx
for format.
Some predictors also require a CancerType
column according to the TCGA abbreviations (https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/tcga-study-abbreviations).
If boolean column MaybeFunctional
(denoting whether variant is a nonsynonymous or splice-site mutation) is not provided, then VarCode will be invoked.
For a full example see the ipython notebook examples/lifd_examples.ipynb
.
========
If you have any questions, you can contact us (https://github.com/johannesreiter) and we will try to help.
Copyright (C) 2019 Johannes Reiter
LiFD is licensed under the GNU General Public License, Version 3. This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, version 3 of the License. There is no warranty for this free software.
========
Author: Johannes G. Reiter, Stanford University, https://reiterlab.stanford.edu