usearch based 16S community profiling pipeline for analysis of ribosomal amplicon sequencing & analysis
-
usearch7 from Rob Edgar's Drive5 site as described here:
-
Edgar, R.C. (2013) UPARSE: Highly accurate OTU sequences from microbial amplicon reads, Nature Methods Pubmed:23955772, dx.doi.org/10.1038/nmeth.2604
-
The naive Bayes RDP classifier from the RDP Project: on github as described here:
-
Wang, Q, G. M. Garrity, J. M. Tiedje, and J. R. Cole. 2007. Naive Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy. Appl Environ Microbiol. 73(16):5261-7.
-
A working MySQL server installation (contact me for a SQLite3 version)
- Edit the file globals with the paths to the above files and the MySQL host and database name; Leave the TRUNCLEN unchanged for now.
- Look at the 0.setup script to make sure the paths to the data are correct and adjust as necessary to find the .fasta, .qual, and mapping.txt files.
- Use the EXECUTE command to run the pipeline and review the results. In particular, pay attention to the data in the 1.quality_filter.stats.log file. Use the rules described on Rob Edgar's site, decide on the TRUNCLEN and possibly the MAXEE parameters.
- The 1.quality_filter.stats.log file contains data on the % of reads falling into the read length bins and what % of reads are accounted for buy a bin and cummlatively. A choice needs to be made between the accumulated % of reads and the avgEE (cumulative error rate average).
- Rerun the EXECUTE command and examine the output.