illuin-tech/data-pipeline

API: test-ability concerns

Opened this issue · 0 comments

An important aspect of this library is to promote small functional bricks assembled into horizontal pipeline designs, testing those bricks and pipeline should be done at two levels:

  • testing of the components (initializer, step, sink, etc.)
  • testing of pipeline configurations (ie. a scenario, a specific combination of components)

This issue is expected to centralize concerns pertaining to the testing model of data-pipeline implementations.

Initializer

TODO

Step

Technically, the Step is expected to be simple, "pure" function: it has an input, it produces an output derived from the input.

The problem is that in order to simplify the pipeline geometry, we needed to diversify Step signatures in order to account for a variety of possible scenarios. A Step can need..:

  • the pipeline input
  • the pipeline payload (normalized/derived data from the input)
  • an indexed pipeline object (whole and/or part of the pipeline payload, which is a proxy for more complex pipeline geometry)
  • the pipeline Results view (for accessing current / latest results)
  • the context (for accessing metadata, parent payload)
  • ..several or all of the above

As a result, current Step have rather convoluted signatures which are provided for by the library, but in a testing scenario the developer is left to his own devices. They may have a sample input or payload, but what about the rest?

Maybe we need a test harness component?

Notes:

  • When a solution is found, we should remove Output based signatures from the Step interface ; this is a hack for enabling simpler interactions from the outside world, but it exposes critical elements of how the Output is used to satisfy the implementation (and, it is overridable by nature, which is not intended)

Sink

TODO

Pipeline

TODO


Previous comments:


This concern has been greatly impacted by [old_pr] due to it introducing more free-form step definitions, as a result, what was once something like:

public Result execute(MyInput input, MyPayload payload, MyObject object, ResultView results, Context<MyPayload> context) { /**/ }

..which required complex test harnesses for producing various cohesive inputs (object being a part of payload, results being a ResultContainer with likely several registered results for the object thread, etc.)

Instead, the annotation-based API makes it possible to request the bare minimum, e.g.:

public Result execute(MyInput input, @Current MyResult result) { /**/ }

..which is a lot easier to reason about when writing unit tests.


We may still need a set of test-harnesses, as it stands the Step, Sink and Initializer interfaces are not going anywhere, and we may still prefer them in some scenarios.

But it looks like the future of data-pipeline should be heavily skewed towards annotation-based implementations.