malariagen/pipelines

Repo file and folder structure

Closed this issue · 3 comments

We anticipate that this repository will store all code artefacts necessary for defining data pipelines. E.g., WDL files, docker files, test datasets, pipeline documentation, etc. What file and folder structure should we use for organising these? Any other conventions we need to agree?

A related question is how we deal with the platform-agnostic versus platform-specific parts of a pipeline.

E.g., @kbergin, @gbggrant, mentioned that the approach taken at Broad is to separate pipelines so you have a core WDL providing a platform-agnostic part, and then a wrapper WDL which provides the necessary config and setup for running in a particular compute environment, and which calls the inner WDL.

I think the idea for this repo is that we would put the core platform-agnostic parts of pipelines here, with the goal that other teams could re-use them to run malariagen pipelines on their own compute infrastructure. But then where do we put the platform-specific wrappers? Here or a different repo?

Straw man proposal:

We keep this repo for only platform-agnostic pipeline code. We should be able to submit pipelines written here to dockstore.

We create a separate github repo for the Sanger platform-specific pipeline wrapper code. Maybe something like malariagen/pipeline-wrappers or something like that. That could be a private repo.

To be figured out, how to conveniently connect the two repos during development. I.e., whoever is doing pipeline development may need to work with both repos at the same time, because there may need to be changes at both levels.

I like this approach, having platform-agnostic pipeline code would allow us to test the pipeline in other environments, which could speed development and get all of the unexpected platform-dependencies out.