Dean Wampler, May 2020
This repo contains Dean's demo for his talk at Spark + AI Summit 2020.
You will need Java 8 for Spark.
Install Java 8 from here. Once installed, verify the version:
$ java -version
java version "1.8.0_221"
Java(TM) SE Runtime Environment (build 1.8.0_221-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.221-b11, mixed mode)
(The patch version 221
may be different.)
Python 3.6 or 3.7 is required. PySpark doesn't yet support 3.8.
We recommend using Anaconda. Installation instructions are here.
Then run the following commands to set up the demo environment, including Python 3.7, PySpark, and Ray:
$ conda env create -f environment.yml
$ conda activate spark-ray
If you prefer to use pip
instead, first ensure you have a supported version of Python installed:
$ python --version
Python 3.7.7
If needed, Python installation instructions are here.
Installation instructions for pip
are here.
Now run the following command to complete the setup.
$ pip install -r requirements.txt
First, start Jupyter Lab as follows:
$ jupyter lab
A browser window will open with the UI.
Double click Spark-RayUDF.ipynb to open the notebook and work through the self-contained example.
There is a second implementation (for completeness), where the UDF makes HTTP requests to a separate service, implemented with Ray Serve. In other words, Ray isn't embedded in the PySpark UDF for this example. The service invoked could be implemented in any technology.
To Run this variant, first evaluate all the cells in the ray-serve/DataGovernanceServer.ipynb notebook. This runs the server for Data Governance requests.
Then work through the ray-serve/Spark-RayServiceUDF.ipynb notebook.
- https://ray.io: The entry point for all things Ray
- Tutorials: https://anyscale.com/academy
- GitHub repo: https://github.com/ray-project/ray
- Need help?
- Ray Slack: https://ray-distributed.slack.com
- ray-dev Google group
- Check out these forthcoming events online: https://anyscale.com/events