Universal Recommender Analysis Tool V2
The primary change to V2 of the tool is to simplify by integrating all external processes so one line will execute all steps.
The steps to execute all workflow are:
- Verify the PIO EventServer is running
pio status
will do this. - Export Data We assume that the data exists in the EventServer or in HDFS. Export the data to the location specified in the
config.json
file. Since it is often desirable to use an existing data split there should be a CLI option to skip splitting to use the paths in the config as they exist. - Split the data as in Alexey's #1 (optional: see #2)
- Train and Deploy To do this 3 PIO workflow steps need to be taken. These must be executed inside the directory of the UR version we are testing.
pio build
This will create the Universal Recommender code and register the algorithm parameterspio train
This will create a model with the UR from theengine.json
parameters. There will be new parameters to pass in topio train
that are system dependent, like how much memory to use on each Spark Driver and Executor as well as others. Not sure how or where to set these. They will not change while using one cluster so perhaps in theconfig.json
.pio deploy
This will create a running PIO PredictionServer that responds to UR queries
- Analyze Same as Alexey's #3
Alexey's Analysis Tools
These scripts do analysis of input data and run cross-validation tests using the Universal Recommender. The analysis show information about usage data, and the CV tests use a technique called MAP@k to rest the predictiveness of the many types of indicators used in a deployment of the UR. The tools run in 2 phases:
- Split: Split the data into train and test splits, calculate usage stats.
- Train and Deploy: Import the "train" dataset into the EventServer, train, and deploy the model. This can also be scripted but is not part of these scripts.
- Analyze: Run several forms of MAP@k predictiveness tests on the splits, collect stats, and create a report for Excel.
Note: This toolset does #1 and #3 but does not do #2.
Since #1 can take a significant amount of time, it is advised to run it once, then train and deploy once, and run #3 with different configs if needed. There is no need to retrain in most cases. This will create a completely reproducible split so you can safely compare results from any run of #3.
If any change is made to the engine.json, repeat #1-3
Setup
Install Spark, PredictionIO v0.9.6 or greater, and the Universal Recommender v0.3.0 or greater. Make sure pio status
completes with no errors and the integration-test for the UR runs correctly.
-
Install Python and check the version
python --v
if the version is less than 2.7.9 upgrade to the most recent stable version of python using systems package management tools like
apt-get
for Ubuntu linux orbrew
for the Mac. -
Install Python libraries using the Python package manager found here
sudo pip install numpy scipy pandas ml_metrics predictionio tqdm click openpyxl
-
Setup Spark and Pyspark paths in
.bashrc
(linux) or.bash_profile
Mac.export SPARK_HOME=/path/to/spark export PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/build/:$PYTONPAHTH
Run Analysis Script
Analysis script should be run from UR (Universal recommender) folder. It uses two configuration files:
engine.json
(configuration of UR, this file is used to take event list and primary event)config.json
(all other configuration including engine.json path if necessary)
Configuration options
config.json has the following structure:
{
"engine_config": "./engine.json",
"splitting": {
"version": "1",
"source_file": "hdfs:...<PUT SOME PATH>...",
"train_file": "hdfs:...<PUT SOME PATH>...train",
"test_file": "hdfs:...<PUT SOME PATH>...test",
"type": "date",
"train_ratio": 0.8,
"random_seed": 29750,
"split_event": "<SOME NAME>"
},
"reporting": {
"file": "./report.xlsx"
},
"testing": {
"map_k": 10,
"non_zero_users_file": "./non_zero_users.dat",
"consider_non_zero_scores_only": true,
"custom_combos": {
"event_groups": [["ev2", "ev3"], ["ev6", "ev8", "ev9"]]
}
},
"spark": {
"master": "spark://<some-url>:7077"
}
}
- engine_config - file to be used as engine.json (see configuration of UR)
- splitting - this section is about splitting data into train and test sets
- version - version to append to train and test file names (may be helpful is different test with different split configurations are run)
- source_file - file with data to be split
- train_file - file with training data to be produced (note that version will be append to file name)
- test_file - file with test data to be produced (note that version will be append to file name)
- type - split type (can be time in this case eventTime will be used to make split or random)
- train_ratio - float in (0..1), share of training samples
- random_seed - seed for random split
- split_event - in case of type = "date", this is event to use to look for split date, all events with this name are ordered by eventTime and time which devides all such events into first train_ratio and last (1 - train_ratio) sets is used to split all the rest data (events with all names) into training set and test set
- reporting - reporting settings
- file - excel file to write report to
- csv_dir - directory name for csv reporting
- use_uuid - append to csv file names unique uuid associated with script run (can be useful to manage different results and reports)
- testing - this section is about different tests and how to perform them
- map_k - maximum map @ k to be reported
- non_zero_users_file - file to save users with scores != 0 after first run of test set with primary event, this set may be much smaller then initial set of users so saving it and reusing can save much time
- consider_non_zero_scores_only (default: true) whether take into account only users with non-zero scores (i.e. users for which recommendations exist)
- custom_combos
- event_groups - groups of events to be additionally tested if necessary
- spark - Apache Spark configuration
- master - for now only master URL is configurable
Split the Data
Get data from the EventServer with:
pio export --output path/to/store/events
Use this command to run split of data into "train" and "test" sets
SPARK_HOME=/usr/local/spark PYTHONPATH=/usr/local/spark/python:/usr/local/spark/python/lib/py4j-0.9-src.zip ./map_test.py split
Additional options are available:
--csv_report
- put report to csv file not excel--intersections
- calculate train / test event intersection data (Advanced)
Train a Model
The above command will create a test and training split in the location specified in config.json. Now you must import, setup engine.json, train and deploy the "train" model so the rest of the MAP@k tests will be able to query the model.
Test and Analysis
To run tests
SPARK_HOME=/usr/local/spark PYTHONPATH=/usr/local/spark/python:/usr/local/spark/python/lib/py4j-0.9-src.zip ./map_test.py test --all
Additional options are available and may be used to run not all test:
--csv_report
---dummy_test
- run dummy test--separate_test
- run test for each separate event--all_but_test
- run test with all events and tests with all but each every event--primary_pairs_test
- run tests with all pairs of events with primary event--custom_combos_test
- run custom combo tests as configured in config.json--non_zero_users_from_file
- use list of users from file prepared on previous script run to save time
Generated Report
Old ipython analysis script
This is not recommended old approach to run ipython notebook.
IPYTHON_OPTS="notebook" /usr/local/spark/bin/pyspark --master spark://spark-url:7077