The gbif_registrar
is a set of Python workflows designed for uploading Environmental Data Initiative (EDI) DwCA-Event core formatted datasets to the Biodiversity Information Facility (GBIF). This enhances discovery and utilization of EDI data, contributing to improved biodiversity insights.
Note, the term "dataset" used here is synonymous with "data package".
The gbif_registrar
operates through three steps:
- Register: Associates an EDI dataset identifier with a GBIF group identifier and stores this information locally in a registrations file for future reference.
- Validate: Runs a validation check on the registration file to ensure all necessary content is in place.
- Upload: Initiates the upload of the newly registered dataset to GBIF.
For subsequent versions of an EDI dataset, the process repeats. The new version is added to the registrations file under the data package series GBIF group ID, undergoes validation for required content, and replaces the previous instance on GBIF.
- Install: The
gbif_registrar
may be installed from GitHub using pip:
pip install git+https://github.com/EDIorg/gbif_registrar.git#egg=gbif_registrar
- Initialize Configuration: Locally configure
gbif_registrar
to run on test or production environments and add credentials for authentication. - Initialize Registration File: Create a registration file to store information about EDI datasets on GBIF.
- Build the Main Workflow: Develop the main.py workflow, outlining the major steps of the process. Below is an example of such a workflow:
from gbif_registrar.configure import load_configuration, unload_configuration
from gbif_registrar.register import register_dataset
from gbif_registrar.validate import validate_registrations
from gbif_registrar.upload import upload_dataset
def main(local_dataset_id, registration_file, configuration_file):
"""Register a dataset and upload to GBIF.
Parameters
----------
local_dataset_id : str
The identifier of a dataset in the EDI repository.
registration_file : str
Path of the registrations file.
configuration_file : str
Path of the configuration file.
Returns
-------
None
The registrations file written back to itself as a .csv, and containing
the new registration record and synchronization status.
Examples
--------
>>> main("edi.929.2", "registrations.csv", "configuration.json")
"""
load_configuration(configuration_file)
register_dataset(local_dataset_id, registration_file)
validate_registrations(registration_file)
upload_dataset(local_dataset_id, registration_file)
unload_configuration()
- Run the Workflow: Run the workflow from the command line, passing in the required arguments:
main("edi.929.2", "registrations.csv", "configuration.json")
If a registration fails:
- Attempt to fix missing components by running the
complete_registration_records
function. - If the issue persists, manually diagnose the issue (see
gbif_registrar
messages) and edit the registrations file. - Rerun the validation check to ensure completeness.
- To preserve acquired data and prevent duplication issues on GBIF, results are continuously written to the registration file.
- Integration tests that upload staged EDI datasets to the GBIF test server are run manually to save time in the development cycle and to respect GBIF storage space. To run the integration test, uncomment the "skip" marker on test_upload_dataset_real_requests in the test suite.
- The
gbif_registrar
wraps the EDI and GBIF APIs. We therefore encourage maintainers of this package to subscribe to the EDI PASTA GitHub repository and the GBIF API mailing list for timely updates on outages and changes, so that the codebase can be updated accordingly.
Interested in contributing? Check out the contributing guidelines. Please note that this project is released with a Code of Conduct. By contributing to this project, you agree to abide by its terms.
gbif_registrar
was created by Colin Smith and Margaret O'Brien. It is licensed under the terms of the MIT license.