/k-taxatree

Primary LanguageRMIT LicenseMIT

k-taxatree

k-taxatree is a classification workflow written in R, predicting the labels of the first four taxonomic levels (kingdom, phylum, class, order) of metagenomic data with a multi-label Random Forest as the underlying model. The latter accepts as input 6-mer count vectors and as such a method to determine the appropriate k-length was also implemented.

Dataset

The project utilizes data from the Earth Microbiome Project and retrieved using the R package empdata. A dataset of 91k sequences of 150bp-length targeted on the 16S rRNA gene of the V4 region. The dataset was split in two subsets, train-test and validation, consisting of 30% and 70% of the initial dataset respectefully. The subsets are available in the emp-data folder.

All data sets were collected from the ftp site of the Earth Microbiome Project.

Sample processing, sequencing, and core amplicon data analysis were performed by the Earth Microbiome Project (www.earthmicrobiome.org), and all amplicon sequence data and metadata have been made public through the EMP data portal (qiita.microbio.me/emp).

Please cite the following publication if you use any of them:

Thompson, L., Sanders, J., McDonald, D. et al. A communal catalogue reveals Earth’s multiscale microbial diversity. Nature 551, 457–463 (2017). https://doi.org/10.1038/nature24621

Getting started

Prerequisites

The packages needed to be installed, in order to run the project are:

  • from CRAN
install.packages(c("parallel", "data.table", "cluster", "Rfast", "plyr", "caret", "stats", "UBL", "splitstackshape", "mlr", "mldr", "dplyr", "hash", "stringr", "randomForestSRC"))

Installing

The project can be downloaded using git:

git clone https://github.com/BiodataAnalysisGroup/k-taxatree

Running the project

The project consists of 9 main scripts in the folder R-scripts containing the code required to perform all the steps from selecting the appropriate k-length and constructing the kmer matrix to predicting the labels of the validation subset. In detail:

  • 01_k_selection_tool.R: selection of the optimal k-length
  • 02_kmer_matrix_creation.R: creation of the whole 6-mer matrix (4095 features)
  • 03_feature_selection.R: steps to select the most informative features (340 features)
  • 04_model_hp_optimization.R: stratified 10 times repeated holdout framework to determine the mtry, ntree, predict.threshold to achive the highest macro f1-score
  • 05_final_model.R: creation of the final model
  • 06_unassigned_predictions.R: utilizing the final model to predict labels for the yet unassigned sequences of the input dataset
  • 07_validation_predictions.R: utilizing the final model to make predictions for the validation subset
  • count_kmers_functions.R: helper functions for creating the kmer matrix
  • multilabel_functions.R: helper functions for the machine learning workflow.

The folder extra-scripts contains a few extra scripts used to:

  • 0_data_for_local_usage.R: create a subset of the training-test set for local usage, used to test the rest of the workflow in a local machine
  • creating_train_test_validation.R: perform the split on the initial dataset
  • blast_results.R: assign taxonomies on the yet unassigned sequences of the dataset using BLAST
  • compare_unassigned_blast.R: compare the k-taxatree predictions with the BLAST results on the unassigned sequences
  • compare_unassigned_rdp.R: compare the RDP predictions with the BLAST results on the unassigned sequences.

The project provides the input datasets and the outputs generated in every step of the workflow. The folder emp-data includes the datasets retrieved from the Earth Microbiome Project Repository. The Output folder contains several subfolders with the outputs generated by the different scripts.

The workflow was run on an Ubuntu server of 141GB RAM and 32 cores and required a total of approximately 20 days.

Given the optimal hyperparameters (mtry = 17, ntree = 300, predict.threshold = 0.2) the user can recreate the final-model using the 05_final_model.R script. Alternatively, the model is provided in the model.rds file in the final-model folder and can be used to make predictions. The latter is fully implemented in the scripts 06_unassigned_predictions.R and 07_validation_predictions.R.

For more details of the individual scripts, please refer to the wiki.

License

This project is licensed under the MIT License - see the LICENSE file for details.