Clinical interpretation of Cancer somatic Variants (CancerVar) and Oncogenic Prioritization by Artificial Intelligence (OPAI)
CancerVar takes either pre-annotated files, or unannotated input files in VCF format or ANNOVAR input format, where each line corresponds to one genetic variant; CancerVar will call ANNOVAR to generate necessary annotations. In the output, based on all 12 pieces of evidence, each variant will be assigned as "Tier_I_strong", "Tier_II_potential", "Tier_IV_benign" and "Tier_III_Uncertain" by rules specified in the AMP/ASCO/CAP 2017 guidelines.
OPAI takes 12 clinical evidence scores from CancerVar and 23 pre-computed in silico scores predicted by other computational tools from ANNOVAR as input, and predicts oncogenicity by a semi-supervised deep-learning model.
CanverVar and OPAI are Python based scripts. The user need to run CancerVar firstly as step 1 to get clinical evidence-based interpretation results and then run OPAI as step 2 if they want to get the deep-learning model-based oncogenicity prediction.
CancerVar.py [options]
CanverVar is a python script for cancer variant interpretation of clinical significance.
- You need install Python >=3.6
- You need install [ANNOVAR](http://annovar.openbioinformatics.org/en/latest/) version >= 2016-02-01.
- Most of the datases can be downloaded automatically.
- Some updated datasets(cosmic and icgc) for Annovar: https://cancervar.wglab.org/databases/ (download and gunzip, put in the Annovar db folder)
- Please use the updated files, outdated files will bring some problems of running CancerVar.
-
-h, --help show this help message and exit
-
--version show program''s version number and exit
-
--config=config.ini Load your config file. The config file contains all options.
if you use this options,you can ignore all the other options bellow.
-
-i INPUTFILE, --input=INPUTFILE input file of variants for analysis
-
--input_type=AVinput The input file type, it can be AVinput(Annovar''sformat),VCF
-
--cancer_type=CANCER The cancer type, please check the help for the details of cancer type: Adrenal_Gland Bile_Duct Bladder Blood Bone Bone_Marrow Brain Breast Cancer_all Cervix Colorectal Esophagus Eye Head_and_Neck Inflammatory Intrahepatic Kidney Liver Lung Lymph_Nodes Nervous_System Other Ovary Pancreas Pleura Prostate Skin Soft_Tissue Stomach Testis Thymus Thyroid Uterus,if you are using avinput file, you can can specify the cancer type in the 6th column
-
-o OUTPUTFILE, --output=OUTPUTFILE prefix the output file (default:output)
-
-b BUILDVER, --buildver=BUILDVER version of reference genome: hg38, hg19(default)
CancerVar Other Options:
-
-t cancervardb, --database_intervar=cancervardb The database location/dir for the CancerVar dataset files
-
-s your_evidence_file, --evidence_file=your_evidence_file User specified Evidence file for each variant
Annovar Options( check these options from manual of Annovar):
-
--table_annovar=./table_annovar.pl The Annovar perl script of table_annovar.pl
-
--convert2annovar=./convert2annovar.pl The Annovar perl script of convert2annovar.pl
-
--annotate_variation=./annotate_variation.pl The Annovar perl script of annotate_variation.pl
-
-d humandb, --database_locat=humandb The database location/dir for the Annovar annotation datasets
python3.6 ./CancerVar.py -c config.ini # Run the examples in config.ini
python3.6 ./CancerVar.py -b hg19 -i your_input --input_type=VCF -o your_output
python3.6 ./CancerVar.py -b hg19 -i example/FDA_hg19.av -o example/FDA
The clinical interpretation results are in the ouput file of "*.cancervar", the column of "CancerVar: CancerVar and Evidence" is the evidence and final interpretation.
After running CancerVar correctly and getting the output files of "*.cancervar" and "*.grl_p",we are ready to run Oncogenic Prioritization by Artificial Intelligence.
OPAI is a python script for Oncogenic Prioritization by Artificial Intelligence after CancerVar. OPAI firstly call feature_preprocess.py to process the features coding from CancerVar and Annovar output, then call opai_predictor.py to predict the oncogenicity.
The OPAI scripts are in the scripts folder of “OPAI”:
- feature_preprocess.py:
- preprocessing the ANNOVAR data and CancerVar output to generate OPAI input;
- opai_predictor.py:
- predicting the oncogenicity of a variant.
OPAI has currently only been tested with Python 3.6+, and requires four Python modules to be installed and in path. These are numpy https://numpy.org, pandas https://pandas.pydata.org , scikit-learn https://scikit-learn.org and pytorch https://pytorch.org.
There are two ways to install these modules:
- Using CONDA and manage the environment.
conda create -n opai python=3.6
conda activate opai
conda install -c anaconda numpy pandas scikit-learn
conda install -c pytorch pytorch=1.9
- Using pip
python3.6 -m pip install numpy --user
python3.6 -m pip install pandas --user
python3.6 -m pip install scikit-learn --user
python3.6 -m pip install torch --user
There are two trained models for prediction in OPAI, located in the folder of "saves":
- Ensemble-based model:
- both clinical evidence score and 23 pre-computed in silico scores are taken as input of the model;
- model file:
ensemble.pt
- Evidence-based model:
- only clinical evidence score are taken as input of the model, this is useful for case of a lot or even all the missing values in 23 pre-computed in silico scores.
- model file:
evs.pt
Users can specify the model by using the -m ensemble
or -m evs
option and then following the -d model_file_location
option.
After running of python3.6 ./CancerVar.py -b hg19 -i example/FDA_hg19.av -o example/FDA
, check files of example/FDA.hg19_multianno.txt.grl_p
and example/FDA.hg19_multianno.txt.cancervar
, see if they are generated correctly.
Then,
- using Ensemble-based model
python3.6 OPAI/scripts/feature_preprocess.py -a example/FDA.hg19_multianno.txt.grl_p -c example/FDA.hg19_multianno.txt.cancervar -m ensemble -n 5 -d OPAI/saves/nonmissing_db.npy -o example/FDA.hg19_multianno.txt.cancervar.ensemble.csv
python3.6 OPAI/scripts/opai_predictor.py -i example/FDA.hg19_multianno.txt.cancervar.ensemble.csv -m ensemble -c OPAI/saves/ensemble.pt -d cpu -v example/FDA.hg19_multianno.txt.cancervar -o example/FDA.hg19_multianno.txt.cancervar.ensemble.pred
The predicted oncogenicity are in the (last)column of "ensemble_score" in file example/FDA.hg19_multianno.txt.cancervar.ensemble.pred
.
- using Evidence-based model
python3.6 OPAI/scripts/feature_preprocess.py -a example/FDA.hg19_multianno.txt.grl_p -c example/FDA.hg19_multianno.txt.cancervar -m evs -n 5 -d OPAI/saves/nonmissing_db.npy -o example/FDA.hg19_multianno.txt.cancervar.evs.csv
python3.6 OPAI/scripts/opai_predictor.py -i example/FDA.hg19_multianno.txt.cancervar.evs.csv -m evs -c OPAI/saves/evs.pt -d cpu -v example/FDA.hg19_multianno.txt.cancervar -o example/FDA.hg19_multianno.txt.cancervar.evs.pred
The predicted oncogenicity are in the (last)column of "evs_score" in file example/FDA.hg19_multianno.txt.cancervar.evs.pred
.
- Feature process using
feature_preprocess.py
python3.6 OPAI/scripts/feature_preprocess.py -h
usage: feature_preprocess.py [-h] -a ANNOVAR_PATH -c CANCERVAR_PATH [-m METHOD] [-n MISSING_COUNT] -d DATABASE -o OUTPUT
feature creator from cancervar output
optional arguments:
-h, --help show this help message and exit
-a ANNOVAR_PATH, --annovar_path ANNOVAR_PATH
the path to annovar file
-c CANCERVAR_PATH, --cancervar_path CANCERVAR_PATH
the path to cancervar file
-m METHOD, --method METHOD
output evs features or ensemble features (option: evs, ensemble)
-n MISSING_COUNT, --missing_count MISSING_COUNT
variant with more than N missing features will be discarded, (default: 5)
-d DATABASE, --database DATABASE
database for feature normalization
-o OUTPUT, --output OUTPUT
the path to output
- Prediction using
opai_predictor.py
python3.6 OPAI/scripts/opai_predictor.py -h
usage: opai_predictor.py [-h] -i INPUT -v CANCERVAR_PATH [-m METHOD] [-d DEVICE] -c CONFIG -o OUTPUT
optional arguments:
-h, --help show this help message and exit
-i INPUT, --input INPUT
the path to input feature
-v CANCERVAR_PATH, --cancervar_path CANCERVAR_PATH
the path to cancervar file
-m METHOD, --method METHOD
use evs features or ensemble features (option: evs, ensemble)
-d DEVICE, --device DEVICE
device used for dl-based predicting (option: cpu, cuda)
-c CONFIG, --config CONFIG
the path to trained model file
-o OUTPUT, --output OUTPUT
the path to output
We also developed a web server http://cancervar.wglab.org, which offers a graphical user interface for CancerVar and OPAI scores.
This web server provided pre-compiled 13M mutations annotation results and OPAI scores. Users can directly search their exonic variants by chromosomal position, by dbSNP identifier, or by gene name with the nucleic acid/amino acid change. The web server will provide full details on the variants, including all automatically generated criteria, most of the supportive evidence and also OPAI scores.
CancerVar and OPAI is free for non-commercial use without warranty. Users need to obtain licenses such as ANNOVAR by themselves. Please contact the authors for commercial use.
Quan Li, Zilin Ren, Kajia Cao, Marilyn M. Li, Yunyun Zhou and Kai Wang. CancerVar: an Artificial Intelligence empowered platform for clinical interpretation of somatic mutations in cancer ( Science Advances, 2022, https://www.science.org/doi/10.1126/sciadv.abj1624 )
Quan Li and Kai Wang. InterVar: Clinical interpretation of genetic variants by ACMG-AMP 2015 guideline. The American Journal of Human Genetics 100, 1-14, February 2, 2017,http://dx.doi.org/10.1016/j.ajhg.2017.01.004
The AMP/ASCO/CAP 2017 guidelines Li MM, Datto M, Duncavage EJ, Kulkarni S, Lindeman NI, Roy S, Tsimberidou AM, Vnencak-Jones CL, Wolff DJ, Younes A, Nikiforova MN. Standards and Guidelines for the Interpretation and Reporting of Sequence Variants in Cancer: A Joint Consensus Recommendation of the Association for Molecular Pathology, American Society of Clinical Oncology, and College of American Pathologists.
The ACMG/CGC 2019 guidelines Mikhail FM, et al. Technical laboratory standards for interpretation and reporting of acquired copy-number abnormalities and copy-neutral loss of heterozygosity in neoplastic disorders: a joint consensus recommendation from the American College of Medical Genetics and Genomics (ACMG) and the Cancer Genomics Consortium (CGC). Genet Med. 2019 Sep;21(9):1903-1916. doi: 10.1038/s41436-019-0545-7.
Thanks to all who provided bug reports.