PredictionIO is an open source machine learning framework.
Two apps are composed to make a basic PredictionIO service:
- Engine: a specialized machine learning app which provides training of a model and then queries against that model; generated from a template or custom code.
- Eventserver: a simple HTTP API app for capturing events to process from other systems; shareable between multiple engines.
✏️ Throughout these docs, code terms that start with $
represent a value (shell variable) that should be replaced with a customized value, e.g $eventserver_name
, $engine_name
, $postgres_addon_id
…
Two styles of deployment are possible on Heroku.
Use PredictionIO engines with a scalable Spark cluster.
Deploy spark-in-space into a Private Space.
The eventserver & engine apps must be created with the --space
option set to the name of a Private Space:
heroku create $eventserver_name --space $space_name
heroku create $engine_name --space $space_name
🚨 Database connection is required during build. Stateless builds to solve this issue are in discussion on the Apache Software Foundation Spark users mailing list
When creating the eventserver's database, a few extra arguments are required to attach to the Private Space.
heroku addons:create heroku-postgresql:standard-0 --region=us -a $eventserver_name --confirm $eventserver_name
Engines must be pointed at the Spark master. Include the --master
option along with any other Spark options:
heroku config:set \
PIO_TRAIN_SPARK_OPTS='--master spark://1.master.$spark_master_name.app.localspace:7077' \
PIO_SPARK_OPTS='--master spark://1.master.$spark_master_name.app.localspace:7077'
This buildpacks supports deploying PredictionIO engines on a single dyno outside a Private Space.
The approach runs Spark within the same process as PredictionIO. This is only recommended for experimental, proof-of-concept work. The limited resources of a single dyno restrict use of typically large, statistically significant datasets.
Only Performance-L dynos with 14GB RAM (currently $16/day) provide reasonable utility in this configuration.
git clone https://github.com/heroku/heroku-buildpack-pio.git pio-eventserver
cd pio-eventserver
heroku create $eventserver_name
heroku addons:create heroku-postgresql:standard-0
heroku buildpacks:add -i 1 https://github.com/heroku/heroku-buildpack-pio.git
heroku buildpacks:add -i 2 https://github.com/heroku/spark-in-space.git
heroku buildpacks:add -i 3 heroku/scala
- Note the Postgres add-on identifier, e.g.
postgresql-aerodynamic-00000
; use it below in place of$postgres_addon_id
- We specify a
standard-0
database, because the freehobby-dev
database is limited to 10,000 records.
We delay deployment until the database is ready.
heroku pg:wait && git push heroku master
Install PredictionIO locally and download an engine template from the gallery. This can be as simple as downloading the source from Github and expanding it on your local computer.
cd
into the engine directory, and ensure it is a git repo:
git init
heroku create $engine_name
heroku buildpacks:add -i 1 https://github.com/heroku/heroku-buildpack-jvm-common.git
heroku buildpacks:add -i 2 https://github.com/heroku/heroku-buildpack-pio.git
heroku buildpacks:add -i 3 https://github.com/heroku/spark-in-space.git
heroku run 'pio app new $pio_app_name' -a $eventserver_name
- This returns an access key for the app; use it below in place of
$pio_app_access_key
.
Replace the Postgres ID & eventserver config values with those from above:
heroku addons:attach $postgres_addon_id
heroku config:set \
PIO_EVENTSERVER_HOSTNAME=$eventserver_dns_name \
PIO_EVENTSERVER_PORT=80 \
PIO_EVENTSERVER_ACCESS_KEY=$pio_app_access_key \
PIO_EVENTSERVER_APP_NAME=$pio_app_name
- See environment variables for details about setting
PIO_EVENTSERVER_HOSTNAME
.
Modify this file to make sure the appName
parameter matches the app record created in the eventserver.
"datasource": {
"params" : {
"appName": "$pio_app_name"
}
}
- If the
appName
param is missing, you may need to upgrade the template.
This step will vary based on the engine. See the template's docs for instructions.
git add .
git commit -m "Initial PIO engine"
git push heroku master
🚨 Private Spaces do not currently support the release-phase script for automatic training. See: Manual training.
pio train
will automatically run during release-phase of the Heroku app.
The release dyno size should be set to a larger dyno, like Performance-L:
heroku ps:scale release=0:Performance-L
Auto training may be disabled with:
heroku config:set PIO_TRAIN_ON_RELEASE=false
heroku run train
# You may need to revive the app from "crashed" state.
heroku restart
PredictionIO provides an Evaluation mode for engines, which uses cross-validation to help select optimum engine parameters.
src/main/scala/Evaluation.scala
support Evaluation mode.
To run evaluation on Heroku, ensure src/main/scala/Evaluation.scala
references the engine's name through the environment. Check the source file to verify that appName
is set to sys.env("PIO_EVENTSERVER_APP_NAME")
. For example:
DataSourceParams(appName = sys.env("PIO_EVENTSERVER_APP_NAME"), evalK = Some(5))
♻️ If that change was made, then commit, deploy, & re-train before proceeding.
Next, start a console & change to the engine's directory:
heroku run bash
$ cd pio-engine/
Then, start the process, specifying the evaluation & engine params classes from the Evaluation.scala
source file. For example:
$ pio eval \
org.template.classification.AccuracyEvaluation \
org.template.classification.EngineParamsList \
-- --driver-class-path /app/lib/postgresql_jdbc.jar
Once pio eval
completes, still in the Heroku console, copy the contents of best.json
:
$ cat best.json
♻️ Paste into your local engine.json
, commit, & deploy.
Engine deployments honor the following config vars:
PIO_OPTS
-
options passed as
pio $opts
-
example:
heroku config:set PIO_OPTS='--variant best.json'
-
PIO_SPARK_OPTS
&PIO_TRAIN_SPARK_OPTS
-
deploy & training options passed through to
spark-submit $opts
-
example:
heroku config:set \ PIO_SPARK_OPTS='--total-executor-cores 2 --executor-memory 1g' \ PIO_TRAIN_SPARK_OPTS='--total-executor-cores 8 --executor-memory 4g'
-
PIO_EVENTSERVER_HOSTNAME
- in Private Space:
web.$eventserver_name.app.localspace
- in Common Runtime:
$eventserver_name.herokuapp.com
- in Private Space:
PIO_EVENTSERVER_PORT
- always
80
for Heroku apps
- always
PIO_EVENTSERVER_APP_NAME
&PIO_EVENTSERVER_ACCESS_KEY
- generated by running
pio app new $pio_app_name
on the eventserver
- generated by running
pio
commands that require DB access will need to have the driver specified as an argument (bug with PIO 0.9.5 + Spark 1.6.1):
pio $command -- --driver-class-path /app/lib/postgresql_jdbc.jar
heroku run "cd pio-engine && pio $command -- --driver-class-path /app/lib/postgresql_jdbc.jar"
Check engine status:
heroku run "cd pio-engine && pio status -- --driver-class-path /app/lib/postgresql_jdbc.jar"