/BOLD-mineR

:shipit: Wrapper for BOLD API (+ barcode auditing)

Primary LanguageRMIT LicenseMIT

DOI

BOLD-mineR

DNA barcodes are not only used by researchers, but also by decision-makers (e.g. to control food fraud or illegal species commercialization). The big-scale demand of both online services and information to identify species contrasts with the limited ways to automatize either species identification or assessment of barcode quality per species, directly from the web interface.

BOLD system is the main database of DNA barcode worldwide. This database has been stepply growing through time since its release (Ratnasingham and Hebert 2007) and its accessibility is pivotal for projects focused on DNA barcodes. Nowdays APIs, to some extant, are offering access to well-know databases such as FishBase, WoRMS or BOLD. Despite BOLD's API mostly involves only public data, this leverages its data retrieving for wider purposes. The API's applicability, however, seems to be wholly held up by its own needs of having either standalone softwares or functions which could wrap up blocks of information. The main objective of these functions (i.e. BOLD-mineR's functions) is justly circumscribe the BOLD's API performance with R-based scripts to get insights about DNA barcodes by using public information.

SpecimenData

This function SpecimenData let us mine associated metadata from any specimen according to following arguments:

  • taxon (e.g. Aves|Elasmobranchii).
  • ids (e.g. ANGBF12704-15).
  • bin (e.g. BOLD:AAA4689).
  • container (e.g. FIPP).
  • institution (e.g. Smithsonian Institution).
  • researchers (including identifiers and collectors).
  • geo (e.g. Peru).

You can find this function here: SpecimenData

If we want to get information, for instance, from all specimens of elasmobranchs distributed in Peru and stored in BOLD, we can use the following line:

specimendata <- SpecimenData(taxon = "Elasmobranchii", geo = "Peru")

Then, we use tibble package to only assess its dimension:

tibble::as.tibble(specimendata)
# A tibble: 10 x 68
       processid sampleid recordID catalognum fieldnum      institution_storing collection_code      bin_uri phylum_taxID phylum_name class_taxID
 *        <fctr>   <fctr>    <int>     <fctr>   <fctr>                   <fctr>           <lgl>       <fctr>        <int>      <fctr>       <int>
 1 ANGBF10913-15 KJ146022  5651960            KJ146022 Mined from GenBank, NCBI              NA BOLD:AAB0473           18    Chordata       34196
 2 ANGBF10914-15 KJ146023  5651961            KJ146023 Mined from GenBank, NCBI              NA BOLD:AAB0473           18    Chordata       34196
 3 ANGBF10915-15 KJ146024  5651962            KJ146024 Mined from GenBank, NCBI              NA BOLD:AAB0473           18    Chordata       34196
 4 ANGBF10916-15 KJ146025  5651963            KJ146025 Mined from GenBank, NCBI              NA BOLD:AAB0473           18    Chordata       34196
 5 ANGBF10917-15 KJ146026  5651964            KJ146026 Mined from GenBank, NCBI              NA BOLD:AAB0473           18    Chordata       34196
 6 ANGBF11015-15 KJ146027  5652062            KJ146027 Mined from GenBank, NCBI              NA BOLD:ACE6212           18    Chordata       34196
 7 ANGBF11043-15 KJ146021  5652090            KJ146021 Mined from GenBank, NCBI              NA BOLD:ABZ0850           18    Chordata       34196
 8 ANGBF11064-15 KJ146044  5652111            KJ146044 Mined from GenBank, NCBI              NA BOLD:AAA7096           18    Chordata       34196
 9 ANGBF11065-15 KJ146043  5652112            KJ146043 Mined from GenBank, NCBI              NA BOLD:AAA7096           18    Chordata       34196
10 ANGBF11066-15 KJ146042  5652113            KJ146042 Mined from GenBank, NCBI              NA BOLD:AAA7096           18    Chordata       34196
# ... with 57 more variables: class_name <fctr>, order_taxID <int>, order_name <fctr>, family_taxID <int>, family_name <fctr>, subfamily_taxID <lgl>,
#   subfamily_name <lgl>, genus_taxID <int>, genus_name <fctr>, species_taxID <int>, species_name <fctr>, subspecies_taxID <lgl>, subspecies_name <lgl>,
#   identification_provided_by <fctr>, identification_method <lgl>, identification_reference <fctr>, tax_note <lgl>, voucher_status <lgl>,
#   tissue_type <lgl>, collection_event_id <lgl>, collectors <fctr>, collectiondate_start <lgl>, collectiondate_end <lgl>, collectiontime <lgl>,
#   collection_note <lgl>, site_code <lgl>, sampling_protocol <fctr>, lifestage <lgl>, sex <lgl>, reproduction <lgl>, habitat <fctr>,
#   associated_specimens <lgl>, associated_taxa <lgl>, extrainfo <fctr>, notes <fctr>, lat <dbl>, lon <dbl>, coord_source <lgl>, coord_accuracy <lgl>,
#   elev <lgl>, depth <int>, elev_accuracy <lgl>, depth_accuracy <lgl>, country <fctr>, province_state <fctr>, region <fctr>, sector <fctr>,
#   exactsite <lgl>, image_ids <lgl>, image_urls <lgl>, media_descriptors <lgl>, captions <lgl>, copyright_holders <lgl>, copyright_years <lgl>,
#   copyright_licenses <lgl>, copyright_institutions <lgl>, photographers <lgl>

We can also add sequences, and its associated information, from each specimen by using the argument seq = "combined":

tibble::as.tibble(SpecimenData(taxon = "Elasmobranchii", geo = "Peru", seq = "combined"))
# A tibble: 47 x 80
       processid sampleid recordID catalognum fieldnum      institution_storing collection_code      bin_uri phylum_taxID phylum_name class_taxID
          <fctr>   <fctr>    <int>     <fctr>   <fctr>                   <fctr>           <lgl>       <fctr>        <int>      <fctr>       <int>
 1 ANGBF10913-15 KJ146022  5651960            KJ146022 Mined from GenBank, NCBI              NA BOLD:AAB0473           18    Chordata       34196
 2 ANGBF10914-15 KJ146023  5651961            KJ146023 Mined from GenBank, NCBI              NA BOLD:AAB0473           18    Chordata       34196
 3 ANGBF10915-15 KJ146024  5651962            KJ146024 Mined from GenBank, NCBI              NA BOLD:AAB0473           18    Chordata       34196
 4 ANGBF10916-15 KJ146025  5651963            KJ146025 Mined from GenBank, NCBI              NA BOLD:AAB0473           18    Chordata       34196
 5 ANGBF10917-15 KJ146026  5651964            KJ146026 Mined from GenBank, NCBI              NA BOLD:AAB0473           18    Chordata       34196
 6 ANGBF11015-15 KJ146027  5652062            KJ146027 Mined from GenBank, NCBI              NA BOLD:ACE6212           18    Chordata       34196
 7 ANGBF11043-15 KJ146021  5652090            KJ146021 Mined from GenBank, NCBI              NA BOLD:ABZ0850           18    Chordata       34196
 8 ANGBF11064-15 KJ146044  5652111            KJ146044 Mined from GenBank, NCBI              NA BOLD:AAA7096           18    Chordata       34196
 9 ANGBF11065-15 KJ146043  5652112            KJ146043 Mined from GenBank, NCBI              NA BOLD:AAA7096           18    Chordata       34196
10 ANGBF11066-15 KJ146042  5652113            KJ146042 Mined from GenBank, NCBI              NA BOLD:AAA7096           18    Chordata       34196
# ... with 37 more rows, and 69 more variables: class_name <fctr>, order_taxID <int>, order_name <fctr>, family_taxID <int>, family_name <fctr>,
#   subfamily_taxID <lgl>, subfamily_name <lgl>, genus_taxID <int>, genus_name <fctr>, species_taxID <int>, species_name <fctr>, subspecies_taxID <lgl>,
#   subspecies_name <lgl>, identification_provided_by <fctr>, identification_method <lgl>, identification_reference <fctr>, tax_note <lgl>,
#   voucher_status <lgl>, tissue_type <lgl>, collection_event_id <lgl>, collectors <fctr>, collectiondate_start <lgl>, collectiondate_end <lgl>,
#   collectiontime <lgl>, collection_note <lgl>, site_code <lgl>, sampling_protocol <fctr>, lifestage <lgl>, sex <lgl>, reproduction <lgl>,
#   habitat <fctr>, associated_specimens <lgl>, associated_taxa <lgl>, extrainfo <fctr>, notes <fctr>, lat <dbl>, lon <dbl>, coord_source <lgl>,
#   coord_accuracy <lgl>, elev <lgl>, depth <int>, elev_accuracy <lgl>, depth_accuracy <lgl>, country <fctr>, province_state <fctr>, region <fctr>,
#   sector <fctr>, exactsite <lgl>, image_ids <lgl>, image_urls <lgl>, media_descriptors <lgl>, captions <lgl>, copyright_holders <lgl>,
#   copyright_years <lgl>, copyright_licenses <lgl>, copyright_institutions <lgl>, photographers <lgl>, sequenceID <int>, markercode <fctr>,
#   genbank_accession <fctr>, nucleotides <fctr>, trace_ids <fctr>, trace_names <fctr>, trace_links <fctr>, run_dates <fctr>, sequencing_centers <fctr>,
#   directions <fctr>, seq_primers <fctr>, marker_codes <fctr>

However, if only sequences are desired, the argument seq = "combined" should change to seq = "only":

SpecimenData(taxon = "Elasmobranchii", geo = "Peru", seq = "only")
47 DNA sequences in binary format stored in a list.

Mean sequence length: 658.851 
   Shortest sequence: 586 
    Longest sequence: 712 

Labels:
ANGBF10913-15|Alopias pelagicus|COI-5P|KJ146022
ANGBF10914-15|Alopias pelagicus|COI-5P|KJ146023
ANGBF10915-15|Alopias pelagicus|COI-5P|KJ146024
ANGBF10916-15|Alopias pelagicus|COI-5P|KJ146025
ANGBF10917-15|Alopias pelagicus|COI-5P|KJ146026
ANGBF11015-15|Carcharhinus brachyurus|COI-5P|KJ146027
...

Base composition:
    a     c     g     t 
0.256 0.264 0.166 0.314 

Above sequence can be also stored by using write.dna() function from the ape package:

seqs <- SpecimenData(taxon = "Elasmobranchii", geo = "Peru", seqs = "only")[1:5] ##we only select five sequences
write.dna(seqs, 'secuencias.txt', format = 'fasta', nbcol = 1, colw = 90)

ID_engine

ID_engine finds best matches between a query sequence and a database of BOLD by using BLASTn-based algorithms. Arguments of this function are query and db. The first one are query sequences and the second one are one of avilable databases in BOLD:

  • COX1
  • COX1_SPECIES
  • COX1_SPECIES_PUBLIC
  • COX1_L640bp

This script also take account for those sequences which are not, by mistake, at the right sense and sends them to perform a BLAST search through its API

You can find this function here: ID_engine

If we want to identify at species level, for instance, all sequence samples stored in secuencias.txt, we should run forthcoming lines:

out <- ID_engine(query = read.FASTA('secuencias.txt'), db = "COX1_SPECIES")

First seven rows and first, fifth and sixth column of each element from the result list out are shown:

lapply(out, function(x){
        x[1:7, c(1,5,6)]
})
$`ANGBF10913-15|Alopias pelagicus|COI-5P|KJ146022`
             ID taxonomicidentification similarity
1   PHANT458-08             Lamniformes          1
2   IRREK872-08             Lamniformes          1
3   IRREK873-08             Lamniformes          1
4   IRREK874-08             Lamniformes          1
5   IRREK876-08             Lamniformes          1
6 ANGBF12625-15       Alopias pelagicus          1
7 ANGBF12624-15       Alopias pelagicus          1

$`ANGBF10914-15|Alopias pelagicus|COI-5P|KJ146023`
             ID taxonomicidentification similarity
1 ANGBF10914-15       Alopias pelagicus          1
2   ESHKB029-07       Alopias pelagicus          1
3 ANGBF10913-15       Alopias pelagicus     0.9985
4   PHANT458-08             Lamniformes     0.9985
5   IRREK873-08             Lamniformes     0.9985
6   IRREK874-08             Lamniformes     0.9985
7   IRREK876-08             Lamniformes     0.9985

$`ANGBF10915-15|Alopias pelagicus|COI-5P|KJ146024`
             ID taxonomicidentification similarity
1 ANGBF12626-15       Alopias pelagicus          1
2 ANGBF12623-15       Alopias pelagicus          1
3 ANGBF11723-15       Alopias pelagicus          1
4 ANGBF10915-15       Alopias pelagicus          1
5   ESHKB036-07       Alopias pelagicus          1
6   ESHKB030-07       Alopias pelagicus          1
7   ESHKB031-07       Alopias pelagicus          1

$`ANGBF10916-15|Alopias pelagicus|COI-5P|KJ146025`
             ID taxonomicidentification similarity
1 ANGBF11731-15       Alopias pelagicus          1
2 ANGBF10916-15       Alopias pelagicus          1
3 ANGBF10915-15       Alopias pelagicus     0.9986
4   ESHKB031-07       Alopias pelagicus     0.9985
5   ESHKB034-07       Alopias pelagicus     0.9985
6 ANGBF11723-15       Alopias pelagicus     0.9984
7   ESHKB030-07       Alopias pelagicus     0.9984

$`ANGBF10917-15|Alopias pelagicus|COI-5P|KJ146026`
             ID taxonomicidentification similarity
1 ANGBF11729-15       Alopias pelagicus          1
2 ANGBF10917-15       Alopias pelagicus          1
3   ESHKB036-07       Alopias pelagicus          1
4 ANGBF10915-15       Alopias pelagicus     0.9986
5   ESHKB031-07       Alopias pelagicus     0.9985
6   ESHKB034-07       Alopias pelagicus     0.9985
7 ANGBF11723-15       Alopias pelagicus     0.9984

addAudition

This function adds an audition step (Oliveira et al. 2016) to each selected specimen by ID_engine() (see above), given a certain threshold. This function, in turn, uses another function called AuditionBarcodes(). This prior function is coupled with addAudition() and can validate species names by taking accepted names from Worms database.

addAudition() function can be found here:

In order to test its efficiency, species-level identification of samples stored in secuencias.txt is conducted by using a threshold = 0.99 (i.e. 99% of similarity):

addAudition(seqs = read.FASTA('secuencias.txt'), threshold = 0.99)
  |**********************************************************************************************************************************************| 100%
                                          Samples  Match           Species Grades                                Observations
1 ANGBF10913-15|Alopias pelagicus|COI-5P|KJ146022 Unique Alopias pelagicus      A There were 44 matches. External congruence.
2 ANGBF10914-15|Alopias pelagicus|COI-5P|KJ146023 Unique Alopias pelagicus      A There were 34 matches. External congruence.
3 ANGBF10915-15|Alopias pelagicus|COI-5P|KJ146024 Unique Alopias pelagicus      A There were 70 matches. External congruence.
4 ANGBF10916-15|Alopias pelagicus|COI-5P|KJ146025 Unique Alopias pelagicus      A There were 34 matches. External congruence.
5 ANGBF10917-15|Alopias pelagicus|COI-5P|KJ146026 Unique Alopias pelagicus      A There were 40 matches. External congruence.

AuditionBarcodes

Despite AuditionBarcodes() function is coupled with addAudition() function, it can also work with just a list of names. Furthermore, there is an argument which enables to chose if sequences from GenBank are considered. It is pending, however, assess whether these sequences used to assess barcode's quality come from either a published article or direct submission:

species <- c("Ceratostoma foliatum" , "Ocinebrina aciculata",
              "Morula margariticola", "Concholepas concholepas",
              "Nucella emarginata" ,"Thais luteostoma", "Thais bufo")
              
AuditionBarcodes(species, include_ncbi = F, validate_name = T)
                  species Grades                                                         Observations                                                                                                               BIN_structure
1    Ceratostoma foliatum      A                                 Matched BIN with external congruence                                                                                  'BOLD:ABV5037':{'Ceratostoma foliatum':32}
2    Ocinebrina aciculata      B                            Matched BIN with internal congruence only                                                                                  'BOLD:ACG0291':{'Ocinebrina aciculata':18}
3    Morula margariticola      C                                                         Splitted BIN                                    'BOLD:AAD8263':{'Drupella margariticola':5}, 'BOLD:AAE7335':{'Drupella margariticola':1}
4 Concholepas concholepas      D Insufficient data. Institution storing: 1. Total specimen records: 1                                                                                                                            
5      Nucella emarginata     E*                                                           Merged BIN                                                                   'BOLD:ABA3756':{'Nucella emarginata':16,'Nucella lima':7}
6        Thais luteostoma    E**                                                         Mixtured BIN 'BOLD:ACB7390':{'Reishia bronni':3,'Reishia luteostoma':16}, 'BOLD:AAW6905':{'Reishia clavigera':11,'Reishia luteostoma':1}
7              Thais bufo      F                           Barcodes mined from GenBank or unvouchered                                                                                                                            

Please notice that grades are obtained with accepted names of species according to WoRMS database by using the worms.py script. Hence, since currently accepted names within species vector has not been figured out, unevenness between the column BIN_structure and species could pop up. For this reason, on above example, Morula margariticola has another name into BIN structure column (i.e. Drupella margariticola) which actually is its currently accepted name.