
Manuscript outline

Opened this issue · 1 comments



  • ...


  • Record counts in each category (16S genes, whole genomes, taxcheck pass vs fail, refseq, reference sequences)
  • Outlier detection and taxcheck outcomes for each subset
  • Discrepancies between taxcheck and outlier detection
  • Maybe: are there any predictors of outliers (eg, by year, source, etc)


  • start a group zotero (YM)
  • gather literature (group)
  • Chris: begin methods in README or elsewhere in repo
  • Create OneDrive doc for MS (NH)
  • Start authoring problem statement (NH)

@yeemey - Here are some more sources we have collected over the years for this project:

  1. Cole, James R., et al. “Ribosomal Database Project: Data and Tools for High Throughput RRNA Analysis.” Nucleic Acids Research, vol. 42, no. Database issue, Jan. 2014, pp. D633-642, https://doi.org/10.1093/nar/gkt1244.
  2. Entrez Programming Utilities Help. National Center for Biotechnology Information (US), 2010.
  3. Federhen, Scott. “Type Material in the NCBI Taxonomy Database.” Nucleic Acids Research, vol. 43, no. D1, Jan. 2015, pp. D1086–98, https://doi.org/10.1093/nar/gku1127.
  4. Hoffman, Noah, et al. “GitHub - Fhcrc/Deenurp: 16S RRNA Gene Sequence Curation and Phylogenetic Reference Set Creation.” GitHub, https://github.com/fhcrc/deenurp. Accessed 23 July 2021.
  5. “GitHub - Fhcrc/Taxtastic: Create and Maintain Phylogenetic ‘Reference Packages’ of Biological Sequences.” GitHub, https://github.com/fhcrc/taxtastic. Accessed 23 July 2021.
  6. Matsen, Frederick A., et al. “Pplacer: Linear Time Maximum-Likelihood and Bayesian Phylogenetic Placement of Sequences onto a Fixed Reference Tree.” BMC Bioinformatics, vol. 11, no. 1, Oct. 2010, p. 538, https://doi.org/10.1186/1471-2105-11-538.
  7. O’Leary, Nuala A., et al. “Reference Sequence (RefSeq) Database at NCBI: Current Status, Taxonomic Expansion, and Functional Annotation.” Nucleic Acids Research, vol. 44, no. D1, Jan. 2016, pp. D733–45, https://doi.org/10.1093/nar/gkv1189.
  8. Sayers, Eric W., et al. “Database Resources of the National Center for Biotechnology Information.” Nucleic Acids Research, vol. 48, no. D1, Jan. 2020, pp. D9–16, https://doi.org/10.1093/nar/gkz899.