/RUCIO-STAC

RUCIO and STAC Integration Sample

Primary LanguageJupyter NotebookMIT LicenseMIT

RUCIO and STAC Metadata Integration

This document summarizes the process involved in the generation of STAC Collections from datasets provided in the InterTwin DataLake accessible using rucio. The central idea is to provide a STAC JSON that could be used in downstream analytic pipelines for the different thematic use cases in the InterTwin project. This STAC JSON is expected to contain links to Cloud-Optimized-GeoTIFFS provided on a publicly accessible S3 storage through the STAC as well as an alternative link to the original datasets available in the InterTwin DataLake. The steps involved in generating these STAC JSON and interacting with the DataLake are outline below.

Main steps

Expand
  • Rucio Installation for a Debian-based OS,
  • Pre-requisites for accessing datasets in the data lake,
  • Downloading specified datasets,
  • Generating STAC Collections using Raster2STAC,
  • Extending STAC JSON to contain link to InterTwin Datalake,
  • Load datasets with downstream packages

Rucio Installation for a Debian-based OS

Expand

A full documentation on how to interact with the data lake using rucio and detailed introduction to important rucio terminologies are provided here. Nonewithstanding, we highlight some of the specific requirements needed for using rucio on a Debian-based OS (such as Ubuntu, which currently being at EURAC Research at the time of writing this document.)

In your development environment, install rucio with pip pip install rucio-clients. This provides you with both the Rucio Client CLI and the Rucio Client Python API, however since rucio uses Gfal, which is not compatible with debian, to download and upload data you would need to run your operations in a containerized environment. The recommended docker image for interacting with the InterTwin datalake is provided here:: dvrbanec/rucio-client:latest, which requires you to mount the configuration file to run effectively.

docker run \
  -v /tmp/rucio.cfg:/opt/rucio/etc/rucio.cfg \
  --name=rucio-client \
  -it -d rucio/rucio-clients

The details for setting up the configuration is provided in the next section.

Pre-requisites for accessing datasets in the data lake

Expand

In order to access the data in the InterTwin datalake, the following pre-requisites should be met.

  1. Register and request access to the interTwin dev (dev.intertwin.eu) with your EGI Check-in credentials here.

    Once signed in with your EGI Credentials, go to People --> Enroll. Search for "Join dev.intertwin.eu VO" and click on "Begin" and request access to the VO from there. Please note that the access approval depends on the availability of the administrator. See this documentation for more details.

  2. Set up your rucio configuration in rucio.cfg. Here is a sample configuration we used:

    [client]
    rucio_host = https://rucio-intertwin-testbed.desy.de
    auth_host = https://rucio-intertwin-testbed-auth.desy.de
    ca_certs = /etc/ssl/certs/ca-bundle.crt
    account = <YOUR_ACCOUNT_NAME> # your EGI check-in account name
    auth_type = oidc
    auth_token_file_path = /tmp/rucio_oauth.token
    oidc_scope = openid profile offline_access eduperson_entitlement
    
    [download]
    transfer_timeout = 3600000
    preferred_impl = xrootd, rclone 
  3. Install the necessary certifications to validate rucio access to the intertwin DataLake, see the compatible files in the provided Dockerfile

  4. Run your container and start using rucio commands, remember to also mount the path you would like your datasets to be stored in

    docker run \
    -v /tmp/rucio.cfg:/opt/rucio/etc/rucio.cfg \
    -v /data_path:~/data_path \
    --name=rucio-client \
    -it -d dvrbanec/rucio-client:latest

    Then use rucio commands:

    docker exec -it rucio-client /bin/bash
    $ rucio ping
    
  5. Authenticate your rucio

    Run rucio whoami and Rucio will give you a link to authenticate yourself. Follow the instructions on the link and at the end you will get a code that you should copy back to Rucio in the terminal. Once you've copied the code back to Rucio, you'll be authenticated to Rucio.

Downloading specified datasets

Expand

To download a specific dataset from the InterTwin data lake, you just need to get to the Data IDentifier (DID) and run rucio get DID

To perform other functions with rucio such as upload and creating datasets, see the full documentation here and here

Generating STAC collections using Raster2STAC

Expand

See sample notebook RUCIO_STAC.ipynb for details on generating STAC collection using Raster2STAC

Extending STAC JSON to contain link to the InterTwin Datalake

Expand

Coming soon!

Load dataset with downstream packages

Expand

With tools like odc.stac or stackstac, it should be possible to load the datasets from the STAC JSON into an xArray object. See stackstac and odc.stac