/pyspark-loglikelihood

PySpark Loglikelihood Similarity Examples

Primary LanguagePythonMIT LicenseMIT

pyspark-loglikelihood

PySpark Loglikelihood Examples.

Inspired by Mahout:

Installation

$ pip install https://github.com/talmago/pyspark-loglikelihood/archive/master.zip

NOTICE: PySpark Loglikelihood is adapted to python2.7 to run. pyenv and virtualenv are recommended for setting an independent python environment.

Usage

After the installation you can use spark-submit command line to execute the item-similarity and user-similarity jobs with your dataset respectively.

Item-Item Similarity (LogLikelihood)

$ spark-submit item_similarity.py \
               input.csv \
               output \
               --maxPrefs=10000 \
               --maxSimilaritiesPerItem 100

NOTICE: Input file lines are expected to be comma seperated vectors of USER_ID,ITEM_ID. Output format will be consisted of ITEM_ID1,ITEM_ID2,SCORE.

User-User Similarity (N-neighborhood + Loglikelihood)

$ spark-submit user_similarity.py \
               input.csv \
               output \
               --numOfNeighbors=40 \
               --numOfRecommendations 1000

NOTICE: Input file lines are expected to be comma seperated vectors of USER_ID,ITEM_ID. Output format will be consisted of USER_ID,ITEM_ID,SCORE.

Example

Run exmple from command line
wget -O - https://raw.githubusercontent.com/talmago/pyspark-loglikelihood/master/examples/item-sim-ml-100l-dataset | bash -x
Step by Step

Step 1: Download and re-format the movielens 100k dataset.

$ wget -O - http://files.grouplens.org/datasets/movielens/ml-100k/u.data | cut -f1 -f2 | tr '\t' ',' > input.csv

Step 2: Upload data set to local hfds

$ hadoop fs -rm -r /item-sim
$ hadoop fs -mkdir -p /item-sim
$ hadoop fs -moveFromLocal input.csv /item-sim/input.csv

Step 3: Run item-silmilarity job on our hadoop data set

$ spark-submit item_similarity.py \
               /item-sim/input.csv \
               /item-sim/output \
               --maxPrefs=10000 \
               --maxSimilaritiesPerItem 100

Step 4: Merge parquet files into a single csv file

$ hadoop fs -getmerge /item-sim/output result.csv

Step 5: Analyze / process / visualize the result set

$ head result.csv
26,381,0.9889748
26,732,0.9876871
26,70,0.98738647
26,715,0.98685825
26,238,0.98625606
26,58,0.98580784
26,1,0.985786
26,83,0.9857064
26,88,0.9856318
26,367,0.9854448