Provide a better mechanism for external GenStages to participate in Flows
CptnKirk opened this issue · 10 comments
https://elixirforum.com/t/flow-into-from-genstage/14262
In the above forum thread we discuss the desire to bring externally written, demand aware components to Flows. We like the Flow orchestration, but the current from_stage, into_stages functionality doesn't quite work.
tl;dr
Stitching multiple flows together via from_stage, into_stages doesn't provide the same behavior as having a GenStage producer_consumer acting as a function within the scope of an overall flow.
We're currently discussing possible naming and behavior in the thread.
Flow.into_producer_consumer/2
Flow.start_producer_consumer
Flow.start_stage?
Maybe we should look at the overall picture for naming and purpose. What is the end goal of the Flow library? Feature parity with Akka Streams? I.e., support full graphs including cycles, multiple input/output components, direct vs async invocation flow invocation.
Akka Streams has a dizzying array of functionality and they're probably on their 8th generation at this point (including many failed starts and iterations).
First, is GenStage Flow's (only) component model? Internally Flow isn't implemented purely as a concatenation of internal GenStages. Should GenStage be the means by which 3rd parties add additional pseudo-DSL support to Flow? Is it Flow's primary component architecture?
GenStage is nice and it brings reactive manifesto concepts to Elixir. But I'm not sure it's the best component model for Flow. Elixir gets a lot of mileage from Enumerable and Collectable protocols, allowing for a rich composition of functions and linear data flows around them.
Now we're looking to provide the next generation of data flow concepts supporting non-linear flows, concurrent parallel flows, all while supporting non-blocking backpressure. I think Flow's component model probably ought to be protocol-based, with the Flow library being primarily responsible for materializing flow resources and coordinating the asynchronous backpressure aspects. Akka is famous for its Actor System, however, the Akka Streams component model isn't actor based.
Flow needs to incorporate GenStages better. But how it does so should be driven by GenStage's desired place within Flow. This was a long way around, but how GenStages ought to fit into Flow will influence the API that supports them. Since the semantics around the function matter.
Call could be:
Flow.add_stage/3 # (flow, stage, options)
Flow.add_and_start_stage/3
Flow.add_graph(stage_to_graph(stage))
Flow.add_producer_consumer/3
Flow.add_and_start_producer_consumer/3
My thinking is that while Flow may ultimately start stages, that the public API wouldn't include any start_stage like APIs. Flow seems to be taking a lifted approach, where it builds up a blueprint and at the end activates it. Most starting should happen at the end when Flow goes to execute that blueprint.
If started GenStages need to be incorporated into a Flow that's ok, a GenStage aware proxy component can be used within the Flow blueprint and do the appropriate thing when that blueprint is materialized and executed. Flow might have helper functions that assist with this, but I wouldn't expect it as part of the primary DSL.
The goal of Flow is not parity with Akka Streams and it is closer to something like Apache Spark but focused on concurrency/single node (at least for now). What you describe should probably be a separate project with its own goals and ideas. I think the APIs proposed earlier (start_producers/from_producers) and similar fit well into Flow because at least you can keep the supervision tree in a single place instead of scattering it around. But we don't plan to go anywhere beyond that.
Just to be super clear, I think all of this is outside of Flow's scope:
Now we're looking to provide the next generation of data flow concepts supporting non-linear flows, concurrent parallel flows, all while supporting non-blocking backpressure. I think Flow's component model probably ought to be protocol-based, with the Flow library being primarily responsible for materializing flow resources and coordinating the asynchronous backpressure aspects. Akka is famous for its Actor System, however, the Akka Streams component model isn't actor based.
Flow is mostly about focusing on the data and not about the graph. You partition because the data requires it, not because of the graph or because of back-pressure.
Ok. Makes sense. I look forward to start_stage/through_stage or whatever APIs then.
Is a more general stream processing library something the core team is interested in looking into? Seems that a Spark implementation should ultimately be built on top of that foundation.
I have started this. I decided to keep the _stages
naming. We will have:
-
from_stages
(producer, producer_consumer),through_stages
(producer_consumer), andinto_stages
(producer_consumer, consumer) -
from_specs
(producer, producer_consumer),through_specs
(producer_consumer), andinto_specs
(producer_consumer, consumer)
The former receives already running processes. It is already in master. The second will receive supervisor child specs so we start those processes as part of flow.
Thoughts?
/cc @lackac
@josevalim I like the semantics of the naming. Simple, in line with the current scheme, yet expressive.
How are you planning to supervise the specs? If I'm not mistaken currently there's a Coordinator process which is not a supervisor, but start_link
s one. This then supervises all GenStage processes. Would the specs given to from_specs
, through_specs
, and into_specs
be on the same level? In what order are they started?
There might be simple answers to these, but when I did this manually I found that it wasn't that easy to figure out the right order. In any case it would be great to include this information in the docs for the new functions. I like the detail you put in the docs of through_stages
and into_stages
.
Good progress! I didn't expect you to have time for this while dealing with the Elixir 1.7 release. :)
@lackac the specs are started under the same supervision tree as the flow (the one under the GenServer) and the stages are always started in order. So producers -> producers_consumers -> producers_consumers -> producers_consumers -> producers_consumers -> consumers.
I have just pushed the code and the docs. So a review and pull requests of any area of improvement is welcome. :) I will cut a new release tomorrow. Thanks for the review so far!
@josevalim sorry, I was offline for most of today. For what it's worth, the changes up to through_stages
looked good to me on paper, but haven't had a chance to try them yet. I'm going to work on a project that uses Flow again tomorrow so will have a chance to test both sets of functions. I'm planning to settle on using from_specs
and into_specs
instead of the hand-rolled supervision we're doing now. I'll report back after that.