Konduit: Enterprise Runtime for Machine Learning Models

Overview

Konduit is a serving system and framework focused on deploying machine learning pipelines to production. The core abstraction is an idea called a "pipeline step". An individual step is meant to perform a task as part of using a machine learning model in a deployment. These steps generally include:

Pre- or post-processing steps
One or more machine learning models
Transforming the output in a way that can be understood by humans, such as labels in a classification example.

For instance, if you want to run arbitrary Python code for preprocessing purposes, you can use aPythonPipelineStep. To perform inference on a (mix of) TensorFlow, Keras, DL4J or PMML models, use ModelPipelineStep. Konduit Serving also contains functionality for other preprocessing tasks, such as DataVec transform processes, or image transforms.

Why Konduit

Konduit was built with the goal of providing proper low level interop with native math libraries such as tensorflow and our very own dl4j's core math library libnd4j.

At the core of pipelines are the javacpp presets, vertx and deeplearning4j for running keras models in java.

Konduit Serving (like some of the other libraries of a similar concept such as seldon or mlflow) provides building blocks for developers to write their own production ML pipelines from pre processing to model serving exposable as a simple rest api.

Combining javacpp's low level access to c like apis from java, with java's robust server side application development (vertx on top of netty) allows for better access to faster math code in production while minimizing the surface area where native code = more security flaws (mainly in server side networked applications)

This allows us to do things like in zero copy memory access of numpy arrays or arrow records for consumption straight from the server without copy or serialization overhead.

When dealing with deep learning, we can handle proper inference on the gpu (batching large workloads)

Extending that to python sdk, we know when to return a raw arrow record and return it as a pandas data frame!

We also strive to provide a python first sdk that makes it easy to integrate pipelines in to a python first workflow.

Optionally, for the java community a vertx based model server and pipeline development framework allow a thin abstraction that is embeddable in a java micro service.

Finally, we want to expose modern standards for monitoring everything from your gpu to your inference time.

Visualization can happen with applications such as grafana or anything that integrates with the prometheus standard for visualizing data.

Finally, we aim to provide integrations with more enterprise platforms typically seen outside the big data space.

Usage

Python SDK

See the python subdirectory for our Python SDK.

Configuring server on startup

Upon startup, the server loads a config.json or config.yaml file specified by the user. If the user specifies a yaml file it is converted to a config.json that is then loaded by vertx.

This gets loaded in to an InferenceConfiguration which just contains a list of pipeline steps. Configuring the steps is relative to the implementation.

A small list (but not all!) of possible implementations can be found here

An individual agent is a java process that gets managed by a KonduitServerMain

Outside of the pipeline components themselves, the main configuration is a ServingConfig which contains information such as the expected port to start the server on, and the host to listen on (default localhost)

If you want your model server to listen on the public internet, please use 0.0.0.0 instead.

Port configuration varies relative to your type of packaging (for example in docker, it may not matter) because the port is already mapped by docker instead.

From there, your pipeline may run in to issues such as memory, or warm up issues.

When dealing with either, there are generally a few considerations:

Off heap memory in javacpp/dl4j
Warmup time for python scripts (sometimes your python script may require warming up the interpreter). Summary of this is: depending on what your python script does when running the python server, you may want to consider sending a warmup request to your application to obtain normal usage.
Python path: When using the python step runner, an additional anaconda distribution maybe required for custom python script execution. An end to end example can be found in the docker directory
For monitoring, your server has an automatic /metrics endpoint built in that is pollable by prometheus or something that can parse the prometheus format.
A PID file automatically gets written upon startup. Overload the locaion with --pidFile=....
Logging is done via logback. Depending on your application, you may want to over ride how the logging works. This can be done by over riding the default logback.xml file
Configurations can be downloaded from the internet! Vertx supports different ways of configuring different configuration providers. Http (without auth) and file are supported by default. For more on this please see the official vertx docs and bundle your custom configuration provider within the built uber jar. If your favorite configuration provider isn't supported, please file an issue.
Timeouts: Sometimes work execution may take longer. If this is the case, please consider looking at the --eventLoopTimeout and --eventLoopExecutionTimeout arguments.
Other vertx arguments: Due to this being a vertx application at its core, other vertx jvm arguments will also work. We specify a few that are important for our specific application (such as file upload directories for binary files) in the KonduitServerMain but allow vertx arguments as well for startup.

For your specific application, consider using the built in monitoring capabilities for both cpu and gpu memory to identify what your ideal pipelines configuration should look like under load.

Core workflow:

The core intended workflow is a few simple steps:

Configure a server setting up:

a. InputTypes of your pipeline b. OutputTypes of your pipeline c. A ServingConfiguration containing things like host and port information d. A series of PipelineSteps that represent what steps a deployed pipeline should perform
Configure a client to connect to the server.

Building/Installation

Dependencies:

Jdk 8 is preferred.
mvnw will download and setup maven automatically

In order to build pipelines, you need to configure:

Chip (-Dchip=YOURCHIP)
OS (-Djavacpp.platform=YOUR PLATFORM)
Way of packaging (-P)
Modules to include for your pipeline steps(-P)

-D is a jvm argument and and -P is a maven profile. Below we specify the requirements for each configuration.

Chips

Konduit Serving can run on a wide variety of chips including:

ARM (experimental): -Dchip=arm
Intel/X86: -Dchip=cpu
CUDA: -Dchip=gpu

Operating systems

Supported operating systems include:

Linux
Mac
Windows

Untested but should work (please let us know if you would like to try setting this up!):

Android
IOS via gluon

Packaging pipelines for a particular operating system typically will depend on the target system's supported chips. For example, we can target linux with ARM or Intel architecture.

Javacpp's platform classifier will also work depending only on the targeted chip. For these concerns, we introduced the -Dchip=gpu/cpu/arm argument to the build. This is a thin abstraction over javacpp's packaging to handle targeting the right platform automatically.

To further thin out other binaries that maybe included (such as opencv) we may use -Djavacpp.platform directly. This approach is mainly tested with intel chips right now. For other chips, please file an issue.

These arguments are as follows:

-Djavacpp.platorm=windows-x86_64 (windows)
-Djavacpp.platform=linux-x86_64
-Djavacpp.platform=mac-osx-x86_64

Specifying this can optimize the jar size quite a bit, otherwise you end up with extra operating system specific binaries in the jar. Initial feedback via github issues is much appreciated!

Packaging options

Konduit Serving packaging works by including all of the needed dependencies relative to the selected profiles/modules desired for inclusion with the package. Output size of the binary depends on a few core variables:

Many of the packaging options depend on the konduit-serving-distro-bom or pipelines bill of materials module. This module contains all of the module inclusion behavior and all of the various dependencies that end up in the output.

All of the modules rely on building an uber jar, and then packaging it in the appropriate platform specific way.

The javacpp.platform jvm argument
The modules included are relative to the maven profiles Modules are described below

Standard Uberjar: -Puberjar
Debian/Ubuntu: -Pdeb
RPM(Centos, RHEL, OpenSuse,..): -Prpm
Docker: -Pdocker
WAR file(Java Servlet Application Servers): -Pwar
TAR file: -Ptar
Kubernetes: See the helm charts directory for sample charts on building a pipelines module for kubernetes.

For now, there are no hosted packages beyond what is working in pip at the moment. Hosted repositories for the packaging formats listed above will be published later.

Modules to include

Python support: -Ppython
PMML support: -Ppmml

In order to configure pipelines for your platform, you use a maven based build profile. An example running on cpu:

./mvnw -Ppython -Ppmml -Dchip=cpu -Djavacpp.platform=windows-x86_64 -Puberjar clean install -Dmaven.test.skip=true

This will automatically download and setup a pipelines uber jar file (see the uber jar sub directory) containing all dependencies needed to run the platform.

The output will be in the target directory of whatever packaging mechanism you specify (docker, tar, ..)

For example if you build an uber jar, you need to use the -Puberjar profile, and the output will be found in model-server-uber-jar/target.

Custom Pipeline steps

Konduit Serving supports customization via 2 ways: python code or implementing your own PipelineStep via the CustomPipelineStep and associated PipelineStepRunner in java.

Custom pipeline steps are generally recommended for performance reasons. Nothing is wrong with starting with just a python step though. Depending on scale, it may not matter.

Orchestration

Running multiple versions of a pipelines server with an orchestrations system with load balancing etc will heavily rely on vertx functionality. Konduit Serving is fairly small in scope right now.

Vertx has support for many different kinds of typical clustering patterns such as an api gateway, circuit breaker

Depending on what the user is looking to do, we could support some built in patterns in the future (for example basic load balanced pipelines).

Vertx itself allows for different patterns that could be implemented in either vertx itself or in kubernetes.

Cluster management is also possible using one of several cluster node managers allowing a concept of node membership. Communication with multiple nodes or processes happens over the vertx event bus. Examples can be found here for how to send messages between instances.

A recommended architecture for fault tolerance is to have an api gateway + load balancer setup with multiple versions of the same pipeline on a named endpoint. That named endpoint would represent a load balanced pipeline instance where 1 of many pipelines maybe served.

In a proper cluster, you would address each instance (an InferenceVerticle in this case representing a worker) as: /pipeline1/some/inference/endpoint

For configuration, we recommend versioning all of your assets that are needed alongside the config.json in something like a bundle where you can download each versioned asset with its associated configuration and model and start the associated instances from that.

Reference KonduitServingMain for an example of the single node use case.

We will add clustering support based on these ideas at a later date. Please file an issue if you have specific questions in trying to get a cluster setup.

License

Every module in this repo is apache license 2.0, save for konduit-serving-pmml which is agpl to comply with the JPMML license.

ShamsUlAzeem/konduit-serving