k-taxatree

k-taxatree is a classification workflow written in R, predicting the labels of the first four taxonomic levels (kingdom, phylum, class, order) of metagenomic data with a multi-label Random Forest as the underlying model. The latter accepts as input 6-mer count vectors and as such a method to determine the appropriate k-length was also implemented.

Dataset

The project utilizes data from the Earth Microbiome Project and retrieved using the R package empdata. A dataset of 91k sequences of 150bp-length targeted on the 16S rRNA gene of the V4 region. The dataset was split in two subsets, train-test and validation, consisting of 30% and 70% of the initial dataset respectefully. The subsets are available in the emp-data folder.

All data sets were collected from the ftp site of the Earth Microbiome Project.

Sample processing, sequencing, and core amplicon data analysis were performed by the Earth Microbiome Project (www.earthmicrobiome.org), and all amplicon sequence data and metadata have been made public through the EMP data portal (qiita.microbio.me/emp).

Please cite the following publication if you use any of them:

Thompson, L., Sanders, J., McDonald, D. et al. A communal catalogue reveals Earth’s multiscale microbial diversity. Nature 551, 457–463 (2017). https://doi.org/10.1038/nature24621

Getting started

Prerequisites

The packages needed to be installed, in order to run the project are:

from CRAN

install.packages(c("parallel", "data.table", "cluster", "Rfast", "plyr", "caret", "stats", "UBL", "splitstackshape", "mlr", "mldr", "dplyr", "hash", "stringr", "randomForestSRC"))

Installing

The project can be downloaded using git:

git clone https://github.com/BiodataAnalysisGroup/k-taxatree

Running the project

The project consists of 9 main scripts in the folder R-scripts containing the code required to perform all the steps from selecting the appropriate k-length and constructing the kmer matrix to predicting the labels of the validation subset. In detail:

01_k_selection_tool.R: selection of the optimal k-length
02_kmer_matrix_creation.R: creation of the whole 6-mer matrix (4095 features)
03_feature_selection.R: steps to select the most informative features (340 features)
04_model_hp_optimization.R: stratified 10 times repeated holdout framework to determine the mtry, ntree, predict.threshold to achive the highest macro f1-score
05_final_model.R: creation of the final model
06_unassigned_predictions.R: utilizing the final model to predict labels for the yet unassigned sequences of the input dataset
07_validation_predictions.R: utilizing the final model to make predictions for the validation subset
count_kmers_functions.R: helper functions for creating the kmer matrix
multilabel_functions.R: helper functions for the machine learning workflow.

The folder extra-scripts contains a few extra scripts used to:

0_data_for_local_usage.R: create a subset of the training-test set for local usage, used to test the rest of the workflow in a local machine
creating_train_test_validation.R: perform the split on the initial dataset
blast_results.R: assign taxonomies on the yet unassigned sequences of the dataset using BLAST
compare_unassigned_blast.R: compare the k-taxatree predictions with the BLAST results on the unassigned sequences
compare_unassigned_rdp.R: compare the RDP predictions with the BLAST results on the unassigned sequences.

The project provides the input datasets and the outputs generated in every step of the workflow. The folder emp-data includes the datasets retrieved from the Earth Microbiome Project Repository. The Output folder contains several subfolders with the outputs generated by the different scripts.

The workflow was run on an Ubuntu server of 141GB RAM and 32 cores and required a total of approximately 20 days.

Given the optimal hyperparameters (mtry = 17, ntree = 300, predict.threshold = 0.2) the user can recreate the final-model using the 05_final_model.R script. Alternatively, the model is provided in the model.rds file in the final-model folder and can be used to make predictions. The latter is fully implemented in the scripts 06_unassigned_predictions.R and 07_validation_predictions.R.

For more details of the individual scripts, please refer to the wiki.

License

This project is licensed under the MIT License - see the LICENSE file for details.