Supporting data for the DeepMicrobes paper.
The codes of DeepMicrobes are available at https://github.com/MicrobeLab/DeepMicrobes
- Reads simulated from 3,269 gut-derived MAGs
- Mock communities generated from gastrointestinal bacterial isolates
- Reads simulated from species absent from reference databases
- Genus-level results
- Species-level results
- Species profiles generated with DeepMicrobes
- Example fastq files from the dataset
Table S1
Metadata of the 2,505 genomes used as reference to train the species classification modelTable S2
Metadata of the 3,269 high quality MAGs used to benchmark the read-level precision and recall of different species classification modelsTable S3
Number of reads sampled from each whole-genome-sequenced bacterial isolates collected from human fecal samplesTable S4
True genus profiles of the ten mock communitiesTable S5
True relative abundance of 14 species members of the ten mock communities shared by the database of nine species classification toolsTable S6
False positive species classification of different tools on data sets generated by simulating 1x coverage reads using 121 genomes from species not included in RefSeq database or training (query aligned = 10-60%, ANI < 95%)Table S7
The effect of k-mer length on vocabulary sizeTable S8
The search space of hyperparameters and the hyperparameters chosen for each model (see https://github.com/MicrobeLab/DeepMicrobes for details of how the hyperparameters are adopted for each model and each layer)Table S9
Read-level precision of different model architectures on variable-length reads simulated from 3,269 MAGs excluded from trainingTable S10
Read-level recall of different model architectures on variable-length reads simulated from 3,269 MAGs excluded from trainingTable S11
Read-level precision of the best model (Embed + LSTM + Attention) on fixed-length reads simulated based on error profiles of different sequencing platforms using 3,269 MAGs excluded from trainingTable S12
Read-level recall of the best model (Embed + LSTM + Attention) on fixed-length reads simulated based on error profiles of different sequencing platforms using 3,269 MAGs excluded from trainingTable S13
Read-level precision and recall of genus classification measured on the ten mock communities under different confidence scoresTable S14
LEfSe analysis on the species profiles of 106 subjects from iHMP
- Genome collections for the species/genus model
- Training labels are provided in the sequence IDs
- The vocabulary of k-mers used in TFRecord conversion. The pre-trained DeepMicrobes models use 12-mers.
- Complementary k-mers are represented as the same embedding vector. Such representation requires
merged
k-mers when converting fasta/fastq sequences to TFRecord.
- Predicted protein sequences from the genomes of each species in the complete bacterial repertoire of human gut
- Custom
names.dmp
,nodes.dmp
, and Kaiju index - Benchmark results on this custom database
- Command lines used to run all the taxonomic classification tools benchmarked in the paper
- R scripts used to generate figures