Gathering barcoding reference sequences for the Kruger National Park
The two scripts to gather sequences from GenBank and BOLD are "GB_BOLD_seq_download.R" which take a list of Latin binomials and download sequences from the rentrez and bold R packages. These can be modified to download reference sequences for other marker genes. Next "CO1_from_GB_mito_genomes.R" follows the same process, but uses rentrez and modified scripts from the PrimerMiner R package to downloadd whole mitochondrial genomes and extract COI sequences.
After downloading the sequences, "format_GB_BOLD_refLib.sh", "generate_taxonomy.R", and "format_refLib_dada_2.R are used to clean up the downloaded FASTA files, generate the taxonomy file, and format for use with dada2's built-in RDP classifer.
Beyond the custom library, additional scripts are included to format the MIDORI and terrimporter COI reference databases for use with dada2.
Contains downloaded FASTA files, whole mitochondrial genomes, and species lists generated by the Kruger National Park.
The "output" folder contains intermediate files, plus the final dada2-formatted reference sequences:
- Kingdom to Genus: "Kruger_Vertebrates_refLib_dada2.fasta"
- Species: "Kruger_Vertebrates_refLib_dada2_species.fasta"
- Phylum to Species: "Kruger_Vertebrates_refLib_dada2_phy2species.fasta"