Simon A. Babayan, Richard J. Orton and Daniel G. Streicker
A series of scripts and datasets described in Babayan et al. (2018) Science doi: 10.1126/science.aap9072 which predict the reservoir hosts, existence of arthropod vectors and identity of arthropod vectors using gradient boosting machines.
BabayanEtAl_sequences.fasta contains coding sequences for all viruses used in the analyses
EbolaTimeSeriesData.csv contains epidemiological data and genomic features for Zaire ebolaviruses sampled during the 2014-2016 West African outbreak
BabayanEtAl_VirusData.csv contains reservoir host, arthropod-borne transmission status and vector taxa for all ssRNA viruses analyzed and features extracted from the genome of each virus
arthropodBorne_featureSelection.R Uses gradient boosting machines in h2o to estimate average feature importances for predicting arthropod-borne transmission across different training sets
arthropodBorne_PN+selGen.R Uses gradient boosting machines in h2o to train and generate study wide (bagged) predictions of the arthropod-borne transmission status of each virus using phylogenetic neighborhoods and genomic features selected by arthropodBorne_featureSelection.R
arthropodBorne_PN.R Uses gradient boosting machines in h2o to train and generate study wide (bagged) predictions of the arthropod-borne transmission status of each virus using phylogenetic neighborhoods
arthropodBorne_selGen.R Uses gradient boosting machines in h2o to train and generate study wide (bagged) predictions of the arthropod-borne transmission status of each virus using genomic features selected by arthropodBorne_featureSelection.R
reservoir_featureSelection.R Uses gradient boosting machines in h2o to estimate average feature importances for predicting reservoir hosts across different training sets
reservoirPredict_PN+selGen.R Uses gradient boosting machines in h2o to train and generate study wide (bagged) predictions of the reservoir host of each virus using phylogenetic neighborhoods and genomic features selected by reservoir_featureSelection.R
reservoirPredict_PN.R Uses gradient boosting machines in h2o to train and generate study wide (bagged) predictions of the reservoir host of each virus using phylogenetic neighborhoods
reservoirPredict_selGen.R Uses gradient boosting machines in h2o to train and generate study wide (bagged) predictions of the reservoir host of each virus using genomic features selected by reservoir_featureSelection.R
vectorPredict_featureSelection.R Uses gradient boosting machines in h2o to estimate average feature importances for predicting reservoir hosts across different training sets
vectorPredict_PN+selGen.R Uses gradient boosting machines in h2o to train and generate study wide (bagged) predictions of the vector of each virus using phylogenetic neighborhoods and genomic features selected by vectorPredict_featureSelection.R
vectorPredict_PN.R Uses gradient boosting machines in h2o to train and generate study wide (bagged) predictions of the vector of each virus using phylogenetic neighborhoods
vectorPredict_selGen.R Uses gradient boosting machines in h2o to train and generate study wide (bagged) predictions of the vector of each virus using genomic features selected by vectorPredict_featureSelection.R
algo_comparison.py Compares the predictive power of a variety of competing machine learning algorithms to predict reservoir hosts, arthropod-borne transmission and vector taxa from all possible genomic features