Rayvens augments Ray with events. With Rayvens, Ray applications can subscribe to event streams, process and produce events. Rayvens leverages Apache Camel to make it possible for data scientists to access hundreds of data services with little effort.
For example, we can periodically fetch the AAPL stock price from a REST API with code:
source_config = dict(
kind='http-source',
url='https://query1.finance.yahoo.com/v7/finance/quote?symbols=AAPL',
period=3000)
source = rayvens.Stream('http', source_config=source_config)
We can publish messages to Slack with code:
sink_config = dict(kind='slack-sink',
channel=slack_channel,
webhook_url=slack_webhook)
sink = rayvens.Stream('slack', sink_config=sink_config)
We can delivers all events from the source
stream to the sink
using code:
source >> sink
We also process events on the fly using Python functions, Ray tasks, or Ray actors and actor methods for stateful processing. For instance, we can log events to the console using code:
source >> (lambda event: print('LOG:', event))
Rayvens is compatible with Ray 1.3 and up.
Rayvens is intended to run anywhere Ray can. Rayvens is routinely tested on macOS 11 (Big Sur) and Ubuntu 18 (Bionic Beaver). Rayvens is distributed both as a Python package on pypi.org and as a container image on quay.io.
To install the latest Rayvens release run:
pip install rayvens
We recommend cloning this repository to obtain the example programs it offers:
git clone https://github.com/project-codeflare/rayvens
The Rayvens package makes it possible to run Ray programs that leverage Rayvens streams to produce and consume internal events. This package does not install Apache Camel, which is necessary to run programs that connect to external event sources and sinks. We discuss the Camel setup and the Rayvens container image below.
The stream.py file demonstrates an elementary Rayvens program.
import ray
import rayvens
# initialize ray
ray.init()
# initialize rayvens
rayvens.init()
# create a stream
stream = rayvens.Stream('example')
# log all future events
stream >> (lambda event: print('LOG:', event))
# append two events to the stream in order
stream << 'hello' << 'world'
This program initialize Ray and Rayvens and creates a Stream
instance. Streams
and events are the core facilities offered by Rayvens. Streams bridge event
publishers and subscribers.
In this example, a subscriber is added to the stream using syntax stream >> subscriber
. The >>
operator is a shorthand for the send_to
method:
stream.send_to(lambda event: print('LOG:', event))
All events appended to the stream after the invocation of the >>
operator
(or send_to
method) will be delivered to the subscriber. Multiple subscribers
may be attached to the same stream. In general, subscribers can be Python
functions, Ray tasks, or Ray actors. Hence, streams can interface publishers and
subscribers running on different Ray nodes.
A couple of events are then published to the stream using the syntax stream << value
. In contrast to subscribers that are registered with the stream, there is
no registration needed to publish event to the stream.
As illustrated here, events are just arbitrary values in general, but of course
publishers and subscribers can agree on specific event schemas. The <<
operator has left-to-right associativity making it possible to send multiple
events with one statement. The <<
operator is a shorthand for the append
method:
stream.append('hello').append('world')
Conceptually, the append
method adds an event at the end of the stream, just
like the append
method of Python lists. But in contrast with lists, a stream
does not persist events. It simply delivers events to subscribers as they come.
In particular, appending events to a stream without subscribers (and without an
operator, see below) is a no-op.
Run the example program with:
python rayvens/examples/stream.py
(pid=37214) LOG: hello
(pid=37214) LOG: world
Observe the two events are delivered in order. Events are delivered to function and actor subscribers in order, but task subscribers offer no ordering guarantees. See the function.py, task.py, and actor.py examples for details.
The <<
and >>
operator are not symmetrical. The send_to
method (resp. >>
operator) invokes its argument (resp. right-hand side) for every event appended
to the stream. The append
method and <<
operator only append one event to
the stream.
Under the hood, streams are implemented as Ray actors. Concretely, the Stream
class is a stateless, serializable, wrapper around the StreamActor
Ray actor
class. All rules applicable to Ray actors (lifecycle, serialization, queuing,
ordering) are applicable to streams. In particular, the stream actor will be
reclaimed when the original stream handle goes out of scope.
The configuration of the stream actor can be tuned using actor_options
:
stream = rayvens.Stream('example', actor_options={num_cpus: 0.1})
For convenience, most methods of the Stream
class including the send_to
method encapsulate the remote invocation of the homonymous StreamActor
method
and block until completion using ray.get
. The append
method is the
exception. It returns immediately. Nevertheless, Ray actor's semantics
guarantees that sequences of append
invocations are processed in order.
For more control, it is possible to invoke methods directly on the stream actor, for example:
stream.actor.send_to.remote(lambda event: print('LOG:', event))
Rayvens uses Camel-K. to interact with a wide range of external source and sink types such as Slack, Cloud Object Storage, Telegram or Binance (to name a few). Camel-K augments Apache Camel's extensive component catalog with support for Kubernetes and serverless platforms. Rayvens is compatible with Camel-K 1.3 and up.
To run Rayvens programs including Camel sources and sinks, there are two choices:
- local mode: run a Camel source or sink in the same execution context as the stream actor it is attached to using the Camel-K client: same container, same virtual or physical machine.
- operator mode: run a Camel source or sink inside a Kubernetes cluster relying on the Camel-K operator to manage dedicated Camel pods.
The default mode is the local mode. The mode can be specified when initializing Rayvens:
rayvens.init(mode='operator')
The mode can also be specified using environment variable RAYVENS_MODE
.
The mode specified in the code (if any) takes precedence.
Local mode is intended to permit running Rayvens anywhere Ray runs: on a developer laptop, in a virtual machine, inside a Ray cluster (running on Kubernetes or OpenShift for example), or standalone.
Local mode requires the Camel-K
client, Java, and Maven
to be installed in the context in which the source or sink will be run.
When running in a cluster, Java and Maven can be added to an existing Ray
installation or image. The Rayvens image is based on a Ray image onto which we
add the necessary dependencies to enable the running of Camel-K sources and
sinks in local mode inside the container.
The all-in-one Rayvens container image distributed on
quay.io adds Camel-K 1.5.1 to a base
rayproject/ray:1.13.0-py38
image. See Dockerfile.release
for specifics.
Operator mode requires access to a Kubernetes cluster running the Camel-K operator and configured with the proper RBAC rules. See below for details.
At this time, the operator mode requires the Ray code to also run inside the same Kubernetes cluster and requires the Camel-K client to be deployed to the Ray nodes. We intend to lift these restrictions shortly.
Installing and using the Camel-K operator to deploy sources and sinks does not require Java or Maven.
Camel-K is designed to pull dependencies dynamically from Maven Central at run time. While it is possible to preload dependencies to support air-gapped execution environments, Rayvens does not handle this yet.
The Rayvens container image makes it
easy to deploy Rayvens-enabled Ray clusters to various container platforms. The
rayvens-setup.sh script supports several
configurations out of the box: existing Kubernetes and OpenShift clusters,
development Kind cluster, IBM Cloud Code
Engine. This script is distributed as
part of the Rayvens package and should typically have been added to the
executable search path by pip install
. It is self-contained and therefore can
also be obtained directly:
curl -Lo rayvens-setup.sh https://raw.githubusercontent.com/project-codeflare/rayvens/main/scripts/rayvens-setup.sh
The full documentation for the script is available here.
The script is provided for convenience. It is of course possible to setup a Rayvens-enabled Ray cluster directly. We provide an example cluster configuration in cluster.yaml. This configuration file is derived from Ray's example-full.yaml configuration file. The key changes are:
- use of Rayvens container image,
- RBAC enhancements to support the Camel-K operator,
- adjustments to resource requests and limits to account for the needs of Camel in local mode.
The generated and example configuration files also set RAY_ADDRESS=auto
on the
head node, making it possible to run our example codes on the Ray cluster
unchanged.
To test Rayvens on a development Kubernetes cluster we recommend using Kind.
We assume Docker Desktop is installed. We assume Kubernetes support in Docker Desktop is turned off. We assume kubectl is installed. Follow instructions to install the Kind client.
To create a Kind cluster and run a Rayvens-enabled Ray cluster on this Kind cluster, run:
rayvens-setup.sh --kind --registry --kamel
The resulting cluster supports both local and operator modes. The command not only initializes the Kind cluster but also launches a docker registry on port 5000 to be used by the Camel-K operator. To skip the registry and Camel-K setup, run instead:
rayvens-setup.sh --kind
In this configuration, only local mode is supported. See here for details.
The setup script produces a rayvens.yaml
Ray cluster configuration file in the
current working directory. Try running on this cluster with:
ray submit rayvens.yaml rayvens/examples/stream.py
To take down the Kind cluster run:
kind delete cluster
To take down the docker registry run:
docker stop registry
docker rm registry
The source.py example demonstrates how to process external events with Rayvens.
First, we create a stream connected to an external event source:
source = rayvens.Stream('http')
source_config = dict(
kind='http-source',
url='https://query1.finance.yahoo.com/v7/finance/quote?symbols=AAPL',
period=3000)
source.add_source(source_config)
An event source configuration is a dictionary. The kind
key specifies the
source type. Other keys vary. An http-source
periodically makes a REST call to
the specified url
. The period
is expressed in milliseconds. The events
generated by this source are the bodies of the responses encoded as strings.
For convenience, the construction of the stream and addition of the source can be combined into a single statement:
source = rayvens.Stream('http', source_config=source_config)
In this example, we use the http-source
to fetch the current price of the AAPL
stock.
We then implement a Ray actor to process these events:
@ray.remote
class Comparator:
def __init__(self):
self.last_quote = None
def append(self, event):
payload = json.loads(event) # parse event string to json
quote = payload['quoteResponse']['result'][0]['regularMarketPrice']
try:
if self.last_quote:
if quote > self.last_quote:
print('AAPL is up')
elif quote < self.last_quote:
print('AAPL is down')
else:
print('AAPL is unchanged')
finally:
self.last_quote = quote
comparator = Comparator.remote()
This actor instance compares the current price with the last price and prints a message accordingly.
We then simply subscribe the comparator
actor instance to the source
stream.
source >> comparator
By using a Ray actor to process events, we can implement stateful processing and guarantee that events will be processed in order.
The Comparator
class follows the convention that it accepts events by means of
a method named append
. If for instance this method were to be named accept
instead, then we would have to subscribe the actor to the source using syntax
source >> comparator.accept
. In other words, subscribing an actor a
to a
stream is a shorthand for subscribing the a.append
method of this actor to the
stream.
Run the example locally with:
python rayvens/examples/source.py
Run the example on Kind with:
ray submit rayvens.yaml rayvens/examples/source.py
When running in local mode, the Camel-K client has to download and cache dependencies on first run from Maven Central. When running in operator mode, the Camel-K operator is used to build and cache a container image for the source. In both cases, the source may take a minute or more to start the first time. The source should start in matter of seconds on subsequent runs (unless it is scheduled to a different Ray worker in local mode, as the cache is not shared across workers).
Rayvens manages the Camel processes and pods automatically and makes sure to terminate them all when the main Ray program exits (normally or abnormally).
The slack.py builds upon the previous example by pushing the output messages to Slack.
In addition to the same source as before, it instantiates a sink:
sink = rayvens.Stream('slack')
sink_config = dict(kind='slack-sink',
channel=slack_channel,
webhook_url=slack_webhook)
sink.add_sink(sink_config)
For convenience, the construction of the stream and addition of the sink can be combined into a single statement:
sink = rayvens.Stream('slack', sink_config=sink_config)
This sink sends messages to Slack. It requires two configuration parameters that must be provided as command-line parameters to the example program:
- the slack channel to publish to, e.g.,
#test
, and - a webhook url for this channel.
Please refer to the Slack webhooks documentation for details on how to obtain these.
This example program includes a Comparator
actor similar to the previous
example:
@ray.remote
class Comparator:
def __init__(self):
self.last_quote = None
def append(self, event):
payload = json.loads(event) # parse event string to json
quote = payload['quoteResponse']['result'][0]['regularMarketPrice']
try:
if self.last_quote:
if quote > self.last_quote:
return 'AAPL is up'
elif quote < self.last_quote:
return 'AAPL is down'
else:
return 'AAPL is unchanged'
finally:
self.last_quote = quote
comparator = Comparator.remote()
In contrast to the previous example, we don't want to simply print messages to the console from the comparator, but rather to produce a new stream of events transformed by the comparator. To this aim, we construct an operator stream:
operator = rayvens.Stream('comparator')
operator.add_operator(comparator)
or simply:
operator = rayvens.Stream('comparator', operator=comparator)
Like any other stream, this operator stream can receive events and deliver
events to subscribers, but unlike earlier example, it applies a transformation
to the events. Concretely, it invokes the append
method of the comparator
instance on each event and delivers the returned value to subscribers. By
convention, when append
does not return a value, i.e., returns None
, no
event is delivered to subscribers. In this example, the first source event does
not generate a Slack message.
We can then connect the source and sink via this operator using code:
source >> operator >> sink
which is a shorthand for:
source.send_to(operator)
operator.send_to(sink)
Like subscribers, the argument to the add_operator
method may be a Python
function, a Ray task, a Ray actor, or a Ray actor method. Using an actor like
comparator
is shorthand for the actor method comparator.append
. Building an
operator stream from a Ray task is not recommended however as it may reorder
events arbitrarily.
We assume the SLACK_CHANNEL
and SLACK_WEBHOOK
environment variables contain
the necessary configuration parameters.
Run the example locally with:
python rayvens/examples/slack.py "$SLACK_CHANNEL" "$SLACK_WEBHOOK"
Run the example on Kind with:
ray submit rayvens.yaml rayvens/examples/slack.py "$SLACK_CHANNEL" "$SLACK_WEBHOOK"
A stream can have zero, one, or multiple sources, zero, one, or multiple sinks, zero or one operator. For instance, rather than using three stream instances to build our Slack example, we could do everything with a single stream as follows:
source_config = dict(
kind='http-source',
url='https://query1.finance.yahoo.com/v7/finance/quote?symbols=AAPL',
period=3000)
sink_config = dict(kind='slack-sink',
channel=slack_channel,
webhook_url=slack_webhook)
operator = rayvens.Stream('comparator',
source_config=source_config,
operator=operator,
sink_config=sink_config)
This reduces the number of stream actors to one down from three and significantly cut the number of remote invocations on the critical path hence reducing latency.
- Rayvens related blogs are published on Medium.
- The
rayvens-setup.sh
script is documented in setup.md. - The configuration of the Camel sources and sinks is explained in connectors.md.
Rayvens is an open-source project with an Apache 2.0 license.