birddb

The goal of birddb is to provide a relational database interface to a local copy of eBird. birddb works by importing the text-based ebird file into a local parquet file using arrow, which can be queried as a relational database using the familiar dplyr interface. dplyr translates R-based queries into SQL commands which are past to duckdb, which then queries the parquet database. Unlike the native arrow interface, duckdb supports the full set of SQL instructions, including windowed operations like group_by+summarise as well as table joins.

Installation

And the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("cboettig/birddb")

Getting Started

library(birddb)
library(dplyr)

Before you can use birddb you will need to download the latest version of the eBird Basic Dataset from http://ebird.org/ebird/data/download. Once you have obtained a downloaded copy of the tar file, birddb can import it for you. The one-time import of the full data dump is a little slow (about 1 hr in my benchmark) due to the time required to extract the tar file and convert the text data into parquet format.

For illustration and testing purposes, we will use the small eBird sample data, included in the package for convenience and testing purposes:

observations_tar <- birddb::sample_observation_data()
checklists_tar <- birddb::sample_checklist_data()

Importing will now create the local parquet-based copies in the default directory given by ebird_data_dir(). Users can set an alternative location by setting the environmental variable BIRDDB_HOME to the desired path.

import_ebird(observations_tar)
#> Importing observations data from the eBird Basic Dataset: ebd_relAug-2021.tar
#> Extracting from tar archive...
#> Importing to parquet...
import_ebird(checklists_tar)
#> Importing checklists data from the eBird Basic Dataset: ebd_sampling_relAug-2021.tar
#> Extracting from tar archive...
#> Importing to parquet...

Once the data have been downloaded and imported successfully, a user can access the full eBird dataset quite quickly:

observations <- observations()
checklists <- checklists()

To see the available columns in each dataset use:

colnames(observations)
#>  [1] "global_unique_identifier"   "last_edited_date"          
#>  [3] "taxonomic_order"            "category"                  
#>  [5] "common_name"                "scientific_name"           
#>  [7] "subspecies_common_name"     "subspecies_scientific_name"
#>  [9] "observation_count"          "breeding_code"             
#> [11] "breeding_category"          "behavior_code"             
#> [13] "age_sex"                    "country"                   
#> [15] "country_code"               "state"                     
#> [17] "state_code"                 "county"                    
#> [19] "county_code"                "iba_code"                  
#> [21] "bcr_code"                   "usfws_code"                
#> [23] "atlas_block"                "locality"                  
#> [25] "locality_id"                "locality_type"             
#> [27] "latitude"                   "longitude"                 
#> [29] "observation_date"           "time_observations_started" 
#> [31] "observer_id"                "sampling_event_identifier" 
#> [33] "protocol_type"              "protocol_code"             
#> [35] "project_code"               "duration_minutes"          
#> [37] "effort_distance_km"         "effort_area_ha"            
#> [39] "number_observers"           "all_species_reported"      
#> [41] "group_identifier"           "has_media"                 
#> [43] "approved"                   "reviewed"                  
#> [45] "reason"                     "trip_comments"             
#> [47] "species_comments"
colnames(checklists)
#>  [1] "last_edited_date"          "country"                  
#>  [3] "country_code"              "state"                    
#>  [5] "state_code"                "county"                   
#>  [7] "county_code"               "iba_code"                 
#>  [9] "bcr_code"                  "usfws_code"               
#> [11] "atlas_block"               "locality"                 
#> [13] "locality_id"               "locality_type"            
#> [15] "latitude"                  "longitude"                
#> [17] "observation_date"          "time_observations_started"
#> [19] "observer_id"               "sampling_event_identifier"
#> [21] "protocol_type"             "protocol_code"            
#> [23] "project_code"              "duration_minutes"         
#> [25] "effort_distance_km"        "effort_area_ha"           
#> [27] "number_observers"          "all_species_reported"     
#> [29] "group_identifier"          "trip_comments"

Now, we can use dplyr to perform standard queries. For example, to see the number of observations for each species in the sample dataset:

observations %>% count(scientific_name, sort = TRUE)
#> # Source:     lazy query [?? x 2]
#> # Database:   duckdb_connection
#> # Ordered by: desc(n)
#>    scientific_name                n
#>    <chr>                      <dbl>
#>  1 Pycnonotus sinensis          275
#>  2 Pycnonotus jocosus           270
#>  3 Streptopelia chinensis       258
#>  4 Milvus migrans               251
#>  5 Copsychus saularis           228
#>  6 Zosterops simplex            201
#>  7 Acridotheres cristatellus    181
#>  8 Passer montanus              174
#>  9 Pterorhinus perspicillatus   172
#> 10 Motacilla alba               172
#> # … with more rows