/ur-analysis-tools

MAP@k and other analysis tools

Primary LanguagePythonApache License 2.0Apache-2.0

Universal Recommender Analysis Tool V2

The primary change to V2 of the tool is to simplify by integrating all external processes so one line will execute all steps.

The steps to execute all workflow are:

  1. Verify the PIO EventServer is running pio status will do this.
  2. Export Data We assume that the data exists in the EventServer or in HDFS. Export the data to the location specified in the config.json file. Since it is often desirable to use an existing data split there should be a CLI option to skip splitting to use the paths in the config as they exist.
  3. Split the data as in Alexey's #1 (optional: see #2)
  4. Train and Deploy To do this 3 PIO workflow steps need to be taken. These must be executed inside the directory of the UR version we are testing.
    1. pio build This will create the Universal Recommender code and register the algorithm parameters
    2. pio train This will create a model with the UR from the engine.json parameters. There will be new parameters to pass in to pio train that are system dependent, like how much memory to use on each Spark Driver and Executor as well as others. Not sure how or where to set these. They will not change while using one cluster so perhaps in the config.json.
    3. pio deploy This will create a running PIO PredictionServer that responds to UR queries
  5. Analyze Same as Alexey's #3

Alexey's Analysis Tools

These scripts do analysis of input data and run cross-validation tests using the Universal Recommender. The analysis show information about usage data, and the CV tests use a technique called MAP@k to rest the predictiveness of the many types of indicators used in a deployment of the UR. The tools run in 2 phases:

  1. Split: Split the data into train and test splits, calculate usage stats.
  2. Train and Deploy: Import the "train" dataset into the EventServer, train, and deploy the model. This can also be scripted but is not part of these scripts.
  3. Analyze: Run several forms of MAP@k predictiveness tests on the splits, collect stats, and create a report for Excel.

Note: This toolset does #1 and #3 but does not do #2.

Since #1 can take a significant amount of time, it is advised to run it once, then train and deploy once, and run #3 with different configs if needed. There is no need to retrain in most cases. This will create a completely reproducible split so you can safely compare results from any run of #3.

If any change is made to the engine.json, repeat #1-3

Setup

Install Spark, PredictionIO v0.9.6 or greater, and the Universal Recommender v0.3.0 or greater. Make sure pio status completes with no errors and the integration-test for the UR runs correctly.

  1. Install Python and check the version

    python --v

    if the version is less than 2.7.9 upgrade to the most recent stable version of python using systems package management tools like apt-get for Ubuntu linux or brew for the Mac.

  2. Install Python libraries using the Python package manager found here

    sudo pip install numpy scipy pandas ml_metrics predictionio tqdm click openpyxl
    
  3. Setup Spark and Pyspark paths in .bashrc (linux) or .bash_profile Mac.

    export SPARK_HOME=/path/to/spark
    export PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/build/:$PYTONPAHTH
    

Run Analysis Script

Analysis script should be run from UR (Universal recommender) folder. It uses two configuration files:

  • engine.json (configuration of UR, this file is used to take event list and primary event)
  • config.json (all other configuration including engine.json path if necessary)

Configuration options

config.json has the following structure:

{
  "engine_config": "./engine.json",

  "splitting": {
    "version": "1",
    "source_file": "hdfs:...<PUT SOME PATH>...",
    "train_file": "hdfs:...<PUT SOME PATH>...train",
    "test_file": "hdfs:...<PUT SOME PATH>...test",
    "type": "date",
    "train_ratio": 0.8,
    "random_seed": 29750,
    "split_event": "<SOME NAME>"
  },

  "reporting": {
    "file": "./report.xlsx"
  },

  "testing": {
    "map_k": 10,
    "non_zero_users_file": "./non_zero_users.dat",
    "consider_non_zero_scores_only": true,
    "custom_combos": {
      "event_groups": [["ev2", "ev3"], ["ev6", "ev8", "ev9"]]
    }
  },

  "spark": {
    "master": "spark://<some-url>:7077"
  }
}
  • engine_config - file to be used as engine.json (see configuration of UR)
  • splitting - this section is about splitting data into train and test sets
    • version - version to append to train and test file names (may be helpful is different test with different split configurations are run)
    • source_file - file with data to be split
    • train_file - file with training data to be produced (note that version will be append to file name)
    • test_file - file with test data to be produced (note that version will be append to file name)
    • type - split type (can be time in this case eventTime will be used to make split or random)
    • train_ratio - float in (0..1), share of training samples
    • random_seed - seed for random split
    • split_event - in case of type = "date", this is event to use to look for split date, all events with this name are ordered by eventTime and time which devides all such events into first train_ratio and last (1 - train_ratio) sets is used to split all the rest data (events with all names) into training set and test set
  • reporting - reporting settings
    • file - excel file to write report to
    • csv_dir - directory name for csv reporting
    • use_uuid - append to csv file names unique uuid associated with script run (can be useful to manage different results and reports)
  • testing - this section is about different tests and how to perform them
    • map_k - maximum map @ k to be reported
    • non_zero_users_file - file to save users with scores != 0 after first run of test set with primary event, this set may be much smaller then initial set of users so saving it and reusing can save much time
    • consider_non_zero_scores_only (default: true) whether take into account only users with non-zero scores (i.e. users for which recommendations exist)
    • custom_combos
      • event_groups - groups of events to be additionally tested if necessary
  • spark - Apache Spark configuration
    • master - for now only master URL is configurable

Split the Data

Get data from the EventServer with:

pio export --output path/to/store/events

Use this command to run split of data into "train" and "test" sets

SPARK_HOME=/usr/local/spark PYTHONPATH=/usr/local/spark/python:/usr/local/spark/python/lib/py4j-0.9-src.zip ./map_test.py split

Additional options are available:

  • --csv_report - put report to csv file not excel
  • --intersections - calculate train / test event intersection data (Advanced)

Train a Model

The above command will create a test and training split in the location specified in config.json. Now you must import, setup engine.json, train and deploy the "train" model so the rest of the MAP@k tests will be able to query the model.

Test and Analysis

To run tests

SPARK_HOME=/usr/local/spark PYTHONPATH=/usr/local/spark/python:/usr/local/spark/python/lib/py4j-0.9-src.zip ./map_test.py test --all

Additional options are available and may be used to run not all test:

  • --csv_report -
  • --dummy_test - run dummy test
  • --separate_test - run test for each separate event
  • --all_but_test - run test with all events and tests with all but each every event
  • --primary_pairs_test - run tests with all pairs of events with primary event
  • --custom_combos_test - run custom combo tests as configured in config.json
  • --non_zero_users_from_file - use list of users from file prepared on previous script run to save time

Generated Report

Old ipython analysis script

This is not recommended old approach to run ipython notebook.

IPYTHON_OPTS="notebook" /usr/local/spark/bin/pyspark --master spark://spark-url:7077