Manuscript outline
Opened this issue · 1 comments
nhoffman commented
Intro
- Problem statement:
- NCBI records vary in quality
- not available for download as a single data set
- annotation not consistent or difficult to piece together
- Previous 16S data sets
- RDP
- GreenGenes
- NCBI bioproject?
- Silva
- 16sitgdb - https://www.frontiersin.org/journals/bioinformatics/articles/10.3389/fbinf.2022.905489/full
- GSR-DB - https://journals.asm.org/doi/10.1128/msystems.00950-23
- Summarize ya16sdb features
- annotation
- outlier detection (includes plotly website)
- sequence subsets by confidence
Methods
- ...
Results/Discussion
- Record counts in each category (16S genes, whole genomes, taxcheck pass vs fail, refseq, reference sequences)
- Outlier detection and taxcheck outcomes for each subset
- Discrepancies between taxcheck and outlier detection
- Maybe: are there any predictors of outliers (eg, by year, source, etc)
TODOs
- start a group zotero (YM)
- gather literature (group)
- Chris: begin methods in README or elsewhere in repo
- Create OneDrive doc for MS (NH)
- Start authoring problem statement (NH)
crosenth commented
@yeemey - Here are some more sources we have collected over the years for this project:
- Cole, James R., et al. “Ribosomal Database Project: Data and Tools for High Throughput RRNA Analysis.” Nucleic Acids Research, vol. 42, no. Database issue, Jan. 2014, pp. D633-642, https://doi.org/10.1093/nar/gkt1244.
- Entrez Programming Utilities Help. National Center for Biotechnology Information (US), 2010.
- Federhen, Scott. “Type Material in the NCBI Taxonomy Database.” Nucleic Acids Research, vol. 43, no. D1, Jan. 2015, pp. D1086–98, https://doi.org/10.1093/nar/gku1127.
- Hoffman, Noah, et al. “GitHub - Fhcrc/Deenurp: 16S RRNA Gene Sequence Curation and Phylogenetic Reference Set Creation.” GitHub, https://github.com/fhcrc/deenurp. Accessed 23 July 2021.
- “GitHub - Fhcrc/Taxtastic: Create and Maintain Phylogenetic ‘Reference Packages’ of Biological Sequences.” GitHub, https://github.com/fhcrc/taxtastic. Accessed 23 July 2021.
- Matsen, Frederick A., et al. “Pplacer: Linear Time Maximum-Likelihood and Bayesian Phylogenetic Placement of Sequences onto a Fixed Reference Tree.” BMC Bioinformatics, vol. 11, no. 1, Oct. 2010, p. 538, https://doi.org/10.1186/1471-2105-11-538.
- O’Leary, Nuala A., et al. “Reference Sequence (RefSeq) Database at NCBI: Current Status, Taxonomic Expansion, and Functional Annotation.” Nucleic Acids Research, vol. 44, no. D1, Jan. 2016, pp. D733–45, https://doi.org/10.1093/nar/gkv1189.
- Sayers, Eric W., et al. “Database Resources of the National Center for Biotechnology Information.” Nucleic Acids Research, vol. 48, no. D1, Jan. 2020, pp. D9–16, https://doi.org/10.1093/nar/gkz899.