/KNP_refLib

Gathering barcoding reference sequences for the Kruger National Park

Primary LanguageR

KNP_refLib

Gathering barcoding reference sequences for the Kruger National Park

Scripts

The two scripts to gather sequences from GenBank and BOLD are "GB_BOLD_seq_download.R" which take a list of Latin binomials and download sequences from the rentrez and bold R packages. These can be modified to download reference sequences for other marker genes. Next "CO1_from_GB_mito_genomes.R" follows the same process, but uses rentrez and modified scripts from the PrimerMiner R package to downloadd whole mitochondrial genomes and extract COI sequences.

After downloading the sequences, "format_GB_BOLD_refLib.sh", "generate_taxonomy.R", and "format_refLib_dada_2.R are used to clean up the downloaded FASTA files, generate the taxonomy file, and format for use with dada2's built-in RDP classifer.

Beyond the custom library, additional scripts are included to format the MIDORI and terrimporter COI reference databases for use with dada2.

Data

Contains downloaded FASTA files, whole mitochondrial genomes, and species lists generated by the Kruger National Park.

Output

The "output" folder contains intermediate files, plus the final dada2-formatted reference sequences:

  • Kingdom to Genus: "Kruger_Vertebrates_refLib_dada2.fasta"
  • Species: "Kruger_Vertebrates_refLib_dada2_species.fasta"
  • Phylum to Species: "Kruger_Vertebrates_refLib_dada2_phy2species.fasta"