/sagetasks

Python library for building ETL pipelines involving Synapse and data processing workflows

Primary LanguagePythonApache License 2.0Apache-2.0

Sage Prefect Tasks

⚠️ Warning: This repository is a work in progress. ⚠️

Python package of useful Prefect tasks for common use cases at Sage Bionetworks.

Some thoughts are included below the Demo Flow and Usage.

Inspired by Pocket/data-flows.

Demo Flow

Demo Flow

Demo Usage

Getting access

To run this demo, you'll need the following access:

  • You need to ask Bruno for edit-access on the INCLUDE Sandbox Synapse project.
  • You need to ask Bruno for edit-access on the include-sandbox Cavatica project.

Getting set up

# Create a virtual environment with the Python dependencies
pipenv install

# Copy the example `.env` file and update the auth tokens
cp .env.example .env

Run the flow at the command line

You'll need to get set up first.

# Run the demo (pipenv will automatically load the `.env` file)
pipenv run python demo.py

Inspect the flow using the Prefect Server UI

You'll need to get set up first.

# Deploy Prefect Server (Orion)
prefect orion start

# Explore the flow runs in Prefect Server
# Usually hosted at http://127.0.0.1:4200/

# Stop the running server with Ctrl-C

Thoughts

  • The CavaticaBaseTask demonstrates a use case for classes (i.e. extending Task) as opposed to functions (i.e. decorated by @task). On the other hand, SynapseBaseTask doesn't really benefit from the class structure.

  • The SevenBridges Python client embeds the client instance into every resource object, which prevents cloudpickle to serialize these objects due to TypeError: cannot pickle '_thread.lock' object.

    import os
    import cloudpickle
    import sevenbridges as sbg
    
    api = sbg.Api(
        url="https://cavatica-api.sbgenomics.com/v2", token=os.environ["SB_AUTH_TOKEN"]
    )
    proj = api.projects.query(name="include-sandbox")[0]
    proj._API = None
    proj._api = None
    proj._data.api = None
    pickle = cloudpickle.dumps(proj)

Note

This project has been set up using PyScaffold 4.3. For details and usage information on PyScaffold see https://pyscaffold.org/.