/DeepMicrobes-data

Supporting data for the DeepMicrobes paper

Primary LanguageR

DeepMicrobes-data

Supporting data for the DeepMicrobes paper.
The codes of DeepMicrobes are available at https://github.com/MicrobeLab/DeepMicrobes

Sequences of benchmark datasets

  • Reads simulated from 3,269 gut-derived MAGs
  • Mock communities generated from gastrointestinal bacterial isolates
  • Reads simulated from species absent from reference databases

Abundance profiles of different taxonomic classifiers on the mock communities

  • Genus-level results
  • Species-level results

Results of the IBD gut metagenome dataset

  • Species profiles generated with DeepMicrobes
  • Example fastq files from the dataset

Supplementary tables (.csv)

  • Table S1 Metadata of the 2,505 genomes used as reference to train the species classification model
  • Table S2 Metadata of the 3,269 high quality MAGs used to benchmark the read-level precision and recall of different species classification models
  • Table S3 Number of reads sampled from each whole-genome-sequenced bacterial isolates collected from human fecal samples
  • Table S4 True genus profiles of the ten mock communities
  • Table S5 True relative abundance of 14 species members of the ten mock communities shared by the database of nine species classification tools
  • Table S6 False positive species classification of different tools on data sets generated by simulating 1x coverage reads using 121 genomes from species not included in RefSeq database or training (query aligned = 10-60%, ANI < 95%)
  • Table S7 The effect of k-mer length on vocabulary size
  • Table S8 The search space of hyperparameters and the hyperparameters chosen for each model (see https://github.com/MicrobeLab/DeepMicrobes for details of how the hyperparameters are adopted for each model and each layer)
  • Table S9 Read-level precision of different model architectures on variable-length reads simulated from 3,269 MAGs excluded from training
  • Table S10 Read-level recall of different model architectures on variable-length reads simulated from 3,269 MAGs excluded from training
  • Table S11 Read-level precision of the best model (Embed + LSTM + Attention) on fixed-length reads simulated based on error profiles of different sequencing platforms using 3,269 MAGs excluded from training
  • Table S12 Read-level recall of the best model (Embed + LSTM + Attention) on fixed-length reads simulated based on error profiles of different sequencing platforms using 3,269 MAGs excluded from training
  • Table S13 Read-level precision and recall of genus classification measured on the ten mock communities under different confidence scores
  • Table S14 LEfSe analysis on the species profiles of 106 subjects from iHMP

Labeled genome sequences used to create the training set

  • Genome collections for the species/genus model
  • Training labels are provided in the sequence IDs

Vocabulary

  • The vocabulary of k-mers used in TFRecord conversion. The pre-trained DeepMicrobes models use 12-mers.
  • Complementary k-mers are represented as the same embedding vector. Such representation requires merged k-mers when converting fasta/fastq sequences to TFRecord.

Custom Kaiju database

  • Predicted protein sequences from the genomes of each species in the complete bacterial repertoire of human gut
  • Custom names.dmp, nodes.dmp, and Kaiju index
  • Benchmark results on this custom database

Scripts

  • Command lines used to run all the taxonomic classification tools benchmarked in the paper
  • R scripts used to generate figures