/easyDataverse

🪐 - Generates Python code based on Dataverse metadatablocks and provides a lightweight interface to upload, download and update datasets found in Dataverse installations.

Primary LanguagePythonMIT LicenseMIT

EasyDataverse
v0.3.6 Build Badge

EasyDataverse is a Python libary used to interface Dataverse installations and generate Python code compatible to a metadatablock configuration given at a Dataverse installation. In addtion, EasyDataverse allows you to export and import datasets to and from various data formats.

Features

  • Code generation from Dataverse TSV metadata configurations.
  • Export and import of datasets to various formats (JSON, YAML, XML and HDF5).
  • Source code publication from a local or GitHub repository towards a Dataverse installation.
  • Fetch datasets from any Dataverse installation into an object oriented structure ready to be integrated.

⚡️ Quick start

Get started with EasyDataverse by running the following command

# Using PyPI
python -m pip install easyDataverse

Or build by source

git clone https://github.com/gdcc/easyDataverse.git
cd easyDataverse
python setup.py install

⚙️ Code generation

EasyDataverse allows you to generate code based on given metadata configuration TSV files that are typically found in any Dataverse installation. In order to do so, you can use the dedicated command line interface:

~ dataverse generate --path ./blocks --out ./my_api --name pyMyAPI

For this, you need to specify the following:

  • --path - Directory where the TSV files are located.
  • --out - Where the generated code will be written to.
  • --name - Name of the resulting API.

🐍 Working with generated APIs

Libaries generated by EasyDataverse are an object-oriented implementation of the metadata configuration files. The resulting classes contain all necessary information to facilitate upload and download while providing a simple interface. This was done to reduce the steep learning curve necessary to write the Dataverse JSON file needed by the native Dataverse REST-API. In the following an example workflow will be demonstrated:

Step 1: Import metadatablocks

Metadata configurations or in this case metadatablocks are found in the same titled module. These blocks can be imported directly from the API and used in an object-oriented manner. This is demonstrated in the following on the example of PyDaRUS which is an API generated for the Dataverse DaRUS of the University of Stuttgart.

from pyDaRUS import Citation

# Initialize the metadatablock
citation = Citation()

Step 2: Add metadata

Now the citation metadata configuration can be filled with information by using attribute assigment. Furthermore, objects in the second hierarchy (aka compounds) can be set using dedicated add_xyz methods. This way it is not necessary to import the sub-classes.

citation.title = "My Title"
citation.add_author(name="Jan Range", affiliation="SimTech")

Step 3: Initialize a Dataset object

When all metadata has been assigned, the Dataset object is set up. This container-like structure provides all necessary functionalities to upload and update datasets to a Dataverse installation. In order to add metadatablocks you need to add these via the add_metadatablock instance method.

from pyDaRUS import Dataset

dataset = Dataset()
dataset.add_metadatablock(citation)

Step 4: Add files an directories

Optionally you can add files and directories to the Dataset instance, which will be uploaded too later on. Adding directories allows you also to keep the local structure of your dataset.

dataset.add_file(dv_path=".", local_path="my.file")
dataset.add_directory(dirpath="./my/dir")

Step 5: Upload your data

Finally, you can upload metadata and files using the upload-method of your dataset instance. Here you specify the target Dataverse Collection to which the dataset will be added.

dataset.upload(dataverse="myCollection")

🚨 Important note

EasyDataverse inferes the DATAVERSE_URL and DATAVERSE_API_TOKEN from your environment variables to prevent accidental credential uploads. You can set these up using the following:

export DATAVERSE_URL="https://my.dataverse.installation"
export DATAVERSE_API_TOKEN="your-token-to-access"

🛸 Dataset download

Programmatic

In order to download datasets programmatically from a Dataverse installation, EasyDataverse offers two options. You can use the Dataset methods from_dataverse_doi and from_url to fetch metadata as well as files from any installation.

from easyDataverse import Dataset

dataset = Dataset.from_url("https://my.dataverse.installation/link/to/dataset")

# or

dataset = Dataset.from_dataverse_doi(
  doi="doi:my_persistent_id",
  dataverse_url="https://my.dataverse.installation"
)

If you'd like to fetch the metadata of a Dataset without downloading the files, add the download_files = False parameter to the functions.

EasyDataverse will infer schemes from the installations REST-API and accordingly generates classes in memory. Thus, you can handle datasets in the same way as with any generated API. For instance, you can edit a fetched dataset and upload it to any other installation.

🚨 Important note

Please note, due to limitations from Dataverse, the metadatablocks will only contain fields that were used in the dataset. For complete blocks, consider using the installations generated API.

Using the command line interface

You can also use the command line interface to fetch data from a Dataverse installation. For this, you only need to provide the URL to the dataset and the following commands.

~ dataverse fetch https://my.dataverse.installation/link/to/dataset

🚀 Source code publication

EasyDataverse allows you to seamlessly push code from your local or remote repository to a Dataverse installation. This can be used in workflows that are triggered by events such as a release to automatically publish your code. In order to do so, you can use the dedicated command line interface:

~ dataverse push --lang Python --dataverse MyDataverse --lib-name pyDaRUS

For this, you need to specify the following:

  • --lang - Programming language used, which will help the parser to infer dependencies.
  • --dataverse - Target Dataverse Collection to which the code will be pushed.
  • --lib-name - API used to access the Dataverse. This is necessary to match the metadata config.
  • --token - API-Token used for authorization. Can also be inferred from env vars.
  • --url - URL to the Dataverse installation. Can also be inferred from env vars.

📖 Documentation and more examples

🚧 Under construction 🚧

✍️ Authors

  • Jan Range (EXC2075 SimTech, University of Stuttgart)

⚠️ License

EasyDataverse is free and open-source software licensed under the MIT License.