mrpaulandrew/procfwk

Allow multiple CurrentExecution to be managed by framework?

htwashere opened this issue · 10 comments

Hi, currently the framework can only handle one collection of pipelines (procfwk.Pipelines) and one execution (CurrentExecution) to be managed at any single time. It would be ideal if a higher level grouping to be allowed so that, we can define multiple sets of pipelines and dependencies, and then we have a capability to submit them as separate jobs, and have the framework to manage multiple jobs to be tracked by procfwk.CurrentExecution? Thank you.

That would be super useful for my current client too

Hey, thanks for your feedback.

@htwashere @NJLangley

Could you please help be in exploring this idea further....

Let's say:

  • A new execution is triggered. It starts processing and running worker pipelines A,B,C.
  • Then a second new execution is triggered which is going to run concurrently.

How would the second execution know what worker it is responsible for?...

  1. Using some custom logic in a precursor to enable/disable workers?
  2. Have the second execution detect what the first execution is already running and ignore those workers?

Or, are we saying, we want the same workers to be called a second time (even if not complete as part of the first run) because we want to pass some different parameters to the pipelines?

I'd be really keen to collaborate on this design here and better understand the use cases?

Many thanks

Hi @mrpaulandrew

In our situation we want to be able to run different executions, usually of separate, unrelated pipelines simultaneously. Some of the reasons for this approach are:

  • Significant number of modules designed to run independently. We don't want a framework ADF and DB per module to avoid infrastructure sprawl.
  • Lots of data sources with different delivery times/cadences. Some sources have tight timings to get loads done and we need things to run as soon as the data is available.
  • Changing the metadata for the load process, even if done via an automated process before an execution, changes the ETL definition from what is in source control. Having multiple defined batches solves this issue.

I have just implemented this as a PoC with my current client and will get the code on my fork in the next few days.

The way we have implemented it is by adding a new concept of a batch, sitting over the top of the stages/pipelines. We have a new table that maps the stage and pipelines onto the batches to allow reuse of pipelines, eg for subtly different daily/weekly/monthly loads. This allows running two (or more) batches at the same time. We have added an extra check to ensure that a new batch cannot be started if it shares a common pipeline with an already executing batch.

Hi @mrpaulandrew @NJLangley : In our case, whenever possible, we want to create a set of re-usable/generic level 100 "worker pipelines". These workers will perform the detailed work but we make use of ADF features such as dynamic linked services so that we can inject in detailed connectivity, databricks notebook name etc. and because they are shareable, they are called by various higher level jobs. We set up a concept of "Container" that allows us to specify different mix of the worker pipelines on as needed basis, and the Container becomes the jobs to be started by your 01-Grandparent or 02-Parent pipelines. If you can imagine it, we have created a many-to-many relationship between Containers and Pipelines. I'll be happy to share this solution with you if you're interested. Cheers. Henry

@htwashere @NJLangley
Thanks for your comments.

It sounds like you've both addressed this in a similar way (Batch/Containers) as a level above stages. Very cool.

I guess in each case my final question would be... How does the parent pipeline know which Batch/Container it should start an execution run for?... Is this simply handled with a parameter passed and set on the parent pipeline trigger?

  • Weekly trigger > Parent > Param = Weekly Batch
  • Daily trigger > Parent > Param = Daily Batch

I'd certainly be interested in seeing both your code implementations if there is an easy way to share them.

Cheers
Paul

@mrpaulandrew I'll try and get our implementation on the fork tomorrow.

Another situation this can be useful for is splitting processing of large files out from smaller ones when the processing window is limited. For example, if you have a bunch of dimensions and small fact tables, and one or two large fact tables that take 80% of the total time, you may want the larger ones in their own batch so that the transform, load and serve stages can continue for the smaller files while the extract for the larger files finishes. The later stages for the larger files can check the data they need for the transforms has loaded before continuing.

In terms of the parent knowing what to do, in our situation we have paramaterised the parent to take an identifier for the batch. This is actually a string unique across batches (eg a web address slug), and we pass that through to the child, infant and worker pipelines too. Passing it along is not strictly required, as it is not used, but passing the param allows us to see the batch id from the ADF monitoring GUI easily.

@NJLangley sure, sounds good.

I'm also thinking that to accompany the batch ID, we should have the concept of worker pipeline parameter sets. This way worker pipelines can be reused for different batches, but have the batch ID link to a parameter set and in turn a set of pipeline parameters.

Hi @mrpaulandrew : yes you're correct, we simply added a parameter called "ContainerName" to be passed at the beginning of 02-Parent. We also plan to pass an array of ContainerNames at 01-Grandparent if necessary in the future. Sounds like our approach is very similar to @NJLangley's implementation.

I am less familiar with github forking. I'll figure out eventually how to send you our implementation. Regards, H.

@htwashere @NJLangley

Thanks again for your feedback, I'm going to take this forward with:
#72

.... and extend the framework with the higher level concept suggested. I like the name Batch.

I'll start working on this maybe tonight.

Cheers

@htwashere @NJLangley just to follow up. The batch execution dev is done. I'm currently testing and will probably release it next week, once the github pages and documentation is up to date.
Cheers