This is a proof of concept on making a no-code / low-code experience for Apache Beam and Dataflow.
tl;dr: We grab a declarative (JSON/YAML) representation of an Apache Beam pipeline, and we generate a Dockerfile with everything needed to run the pipeline.
The JSON/YAML representation (low-code) can be easily generated via a graphical UI interface (no-code).
All the core Apache Beam transforms would be supported out-of-the-box in the JSON/YAML representation. This includes element-wise transforms, aggregation transforms, windowing, triggers, as well as I/O transforms. This also includes a transform to call used-defined functions, described below.
Custom functions are supported in one or more languages.
For this prototype, we support custom functions in Python only, but any language could easily be supported by creating a local web server.
All the language servers must have a well defined input and output format:
- Each function processes exactly one element.
- Additional arguments can be optionally added.
- Requests and responses are JSON encoded.
- The response is either a value or an error.
When a custom function is used in a pipeline, it uses a custom DoFn
to call the custom function.
Each custom function has a URL through which it's accessible (local or remote).
Each element is encoded into JSON, passed to the custom function server, and the response is JSON decoded.
The JSON/YAML pipeline file would contain all the necessary information to build the image, except for the user-defined functions' code itself. It would include:
- All the steps in the pipeline.
- User-defined function calls need the language, function name, and any additional arguments.
- A list of requirements for each language used in user-defined functions.
The user files would be:
- The JSON/YAML pipeline file.
- All the user-defined functions in their respective language, with one (public) function per file.
The language server files could be provided in Beam released images, one image per language.
The Dockerfile would be a multi-stage build like this:
- Pipeline builder stage
- Copy or install the pipeline generator and any other requirements.
- Copy the JSON/YAML pipeline file from the local filesystem.
- Run the generator, which would create the following files:
- The
main
pipeline file, with all user-defined functions registered. - A
run
script, which would start all the language servers and then run the Beam worker boot file.
- The
- For each language used in user-defined functions, create a builder stage
- These could base from different base images if needed
- Install any required build tools, if any
- Copy the user-defined function files from the local filesystem.
- Copy the language server files from the language server image.
- Compile any source files for languages that require it.
- Package the language server with all the user-defiend functions
- Main image
- Update/install any packages needed, including:
- Tools/programs needed to run the pipeline itself.
- Tools/programs needed to run each language server used.
- Copy the Beam worker boot files from the Beam image.
- Copy the
main
pipeline file(s) from the pipeline builder stage. - Copy the
run
script from the pipeline builder stage. - For each language builder stage, copy the packaged language servers.
- Set the entry point to the
run
script.
- Update/install any packages needed, including:
export PROJECT=$(gcloud config get-value project)
gcloud builds submit -t gcr.io/$PROJECT/dataflow-no-code build/
docker run --rm -e ACTION=run gcr.io/$PROJECT/dataflow-no-code
gcloud builds submit --config run-dataflow.yaml --no-source