/ViralHostPredictor

Primary LanguageRGNU General Public License v3.0GPL-3.0

Predicting Reservoir Hosts and Arthropod Vectors from Evolutionary Signatures in RNA Virus Genomes

Simon A. Babayan, Richard J. Orton and Daniel G. Streicker

Background

A series of scripts and datasets described in Babayan et al. (2018) Science doi: 10.1126/science.aap9072 which predict the reservoir hosts, existence of arthropod vectors and identity of arthropod vectors using gradient boosting machines.

File descriptions

Datasets:

BabayanEtAl_sequences.fasta contains coding sequences for all viruses used in the analyses

EbolaTimeSeriesData.csv contains epidemiological data and genomic features for Zaire ebolaviruses sampled during the 2014-2016 West African outbreak

BabayanEtAl_VirusData.csv contains reservoir host, arthropod-borne transmission status and vector taxa for all ssRNA viruses analyzed and features extracted from the genome of each virus

R scripts:

arthropodBorne_featureSelection.R Uses gradient boosting machines in h2o to estimate average feature importances for predicting arthropod-borne transmission across different training sets

arthropodBorne_PN+selGen.R Uses gradient boosting machines in h2o to train and generate study wide (bagged) predictions of the arthropod-borne transmission status of each virus using phylogenetic neighborhoods and genomic features selected by arthropodBorne_featureSelection.R

arthropodBorne_PN.R Uses gradient boosting machines in h2o to train and generate study wide (bagged) predictions of the arthropod-borne transmission status of each virus using phylogenetic neighborhoods

arthropodBorne_selGen.R Uses gradient boosting machines in h2o to train and generate study wide (bagged) predictions of the arthropod-borne transmission status of each virus using genomic features selected by arthropodBorne_featureSelection.R

reservoir_featureSelection.R Uses gradient boosting machines in h2o to estimate average feature importances for predicting reservoir hosts across different training sets

reservoirPredict_PN+selGen.R Uses gradient boosting machines in h2o to train and generate study wide (bagged) predictions of the reservoir host of each virus using phylogenetic neighborhoods and genomic features selected by reservoir_featureSelection.R

reservoirPredict_PN.R Uses gradient boosting machines in h2o to train and generate study wide (bagged) predictions of the reservoir host of each virus using phylogenetic neighborhoods

reservoirPredict_selGen.R Uses gradient boosting machines in h2o to train and generate study wide (bagged) predictions of the reservoir host of each virus using genomic features selected by reservoir_featureSelection.R

vectorPredict_featureSelection.R Uses gradient boosting machines in h2o to estimate average feature importances for predicting reservoir hosts across different training sets

vectorPredict_PN+selGen.R Uses gradient boosting machines in h2o to train and generate study wide (bagged) predictions of the vector of each virus using phylogenetic neighborhoods and genomic features selected by vectorPredict_featureSelection.R

vectorPredict_PN.R Uses gradient boosting machines in h2o to train and generate study wide (bagged) predictions of the vector of each virus using phylogenetic neighborhoods

vectorPredict_selGen.R Uses gradient boosting machines in h2o to train and generate study wide (bagged) predictions of the vector of each virus using genomic features selected by vectorPredict_featureSelection.R

Python script

algo_comparison.py Compares the predictive power of a variety of competing machine learning algorithms to predict reservoir hosts, arthropod-borne transmission and vector taxa from all possible genomic features