DeepMicrobes-data

Supporting data for the DeepMicrobes paper.
The codes of DeepMicrobes are available at https://github.com/MicrobeLab/DeepMicrobes

Sequences of benchmark datasets

Reads simulated from 3,269 gut-derived MAGs
Mock communities generated from gastrointestinal bacterial isolates
Reads simulated from species absent from reference databases

Abundance profiles of different taxonomic classifiers on the mock communities

Genus-level results
Species-level results

Results of the IBD gut metagenome dataset

Species profiles generated with DeepMicrobes
Example fastq files from the dataset

Supplementary tables (.csv)

Table S1 Metadata of the 2,505 genomes used as reference to train the species classification model
Table S2 Metadata of the 3,269 high quality MAGs used to benchmark the read-level precision and recall of different species classification models
Table S3 Number of reads sampled from each whole-genome-sequenced bacterial isolates collected from human fecal samples
Table S4 True genus profiles of the ten mock communities
Table S5 True relative abundance of 14 species members of the ten mock communities shared by the database of nine species classification tools
Table S6 False positive species classification of different tools on data sets generated by simulating 1x coverage reads using 121 genomes from species not included in RefSeq database or training (query aligned = 10-60%, ANI < 95%)
Table S7 The effect of k-mer length on vocabulary size
Table S8 The search space of hyperparameters and the hyperparameters chosen for each model (see https://github.com/MicrobeLab/DeepMicrobes for details of how the hyperparameters are adopted for each model and each layer)
Table S9 Read-level precision of different model architectures on variable-length reads simulated from 3,269 MAGs excluded from training
Table S10 Read-level recall of different model architectures on variable-length reads simulated from 3,269 MAGs excluded from training
Table S11 Read-level precision of the best model (Embed + LSTM + Attention) on fixed-length reads simulated based on error profiles of different sequencing platforms using 3,269 MAGs excluded from training
Table S12 Read-level recall of the best model (Embed + LSTM + Attention) on fixed-length reads simulated based on error profiles of different sequencing platforms using 3,269 MAGs excluded from training
Table S13 Read-level precision and recall of genus classification measured on the ten mock communities under different confidence scores
Table S14 LEfSe analysis on the species profiles of 106 subjects from iHMP

Labeled genome sequences used to create the training set

Genome collections for the species/genus model
Training labels are provided in the sequence IDs

Vocabulary

The vocabulary of k-mers used in TFRecord conversion. The pre-trained DeepMicrobes models use 12-mers.
Complementary k-mers are represented as the same embedding vector. Such representation requires merged k-mers when converting fasta/fastq sequences to TFRecord.

Custom Kaiju database

Predicted protein sequences from the genomes of each species in the complete bacterial repertoire of human gut
Custom names.dmp, nodes.dmp, and Kaiju index
Benchmark results on this custom database

Scripts

Command lines used to run all the taxonomic classification tools benchmarked in the paper
R scripts used to generate figures

MicrobeLab/DeepMicrobes-data

DeepMicrobes-data

Sequences of benchmark datasets

Abundance profiles of different taxonomic classifiers on the mock communities

Results of the IBD gut metagenome dataset

Supplementary tables (.csv)

Labeled genome sequences used to create the training set

Vocabulary

Custom Kaiju database

Scripts