r2v: A Python repository from cedias

===========================================================
UNIFIED VECTOR SPACE MODEL FOR ENRICHED RECOMMENDATION CODE
===========================================================

-----------------------
REQUIREMENTS & INSTALL
-----------------------

Tested with Python 3.4 on last debian and last linux mint

Requirements:
    - gcc
    - Scipy
    - Numpy
    - Gensim
    - SQLITE3
    - Scikit-Learn
    - Wordcloud (https://github.com/amueller/word_cloud)

To install:

1) Install dependencies

    pip install -r requirements.txt
    or
    pip3 install -r requirements.txt

2) Compile C word2vec:

    cd d2v
    make


-----------------------
QUICK TEST SCRIPTS
-----------------------


Recommender System:
-----------------------

/!\ NOTE: TRAINS ONLY FOR ONE ITERATION FOR SPEED PURPOSE. (but results are already quite near paper ones) /!\

python3.4 [--threads <int> (default = 5)] demo_reco.py

Command sequence:

    1 - download dataset (~400 Mo)
    wget "http://95.85.49.48/ratebeer.txt.gz"

    2 - Build database
    python3.4 buildDatabase.py --encoding ascii --gz ratebeer.txt.gz ratebeer ratebeer.db

    3 - Format for d2v
    python3.4 db_to_R2V.py --min_count 10000 ratebeer.db ratebeer-10k.txt

    4 - Learn Model
    ./d2v/d2v -train ratebeer-10k.txt -sentence-vectors 1 -size 200 -window 10 -cbow 0 -min-count 0 -sample 10e-4 -negative 5 -threads 5 -binary 1 -iter 1 -alpha 0.08 -output rb.d2v

    5 - Predict ratings
    python3.4 predict_rating.py --k 25 --neigh user rb.d2v ratebeer.db

    6 - Predict reviews (not in demo script)
    python3.4 predict_review.py --neigh user rb.d2v ratebeer.db


Sentiment Treebank:
--------------------

python3.4 [--threads <int> (default = 5)] demo_treebank.py

Command sequence:

    1 - Format for d2v
    python3.4 treebank_to_R2V.py --output treebank.d2v Data/stanfordSentimentTreebank

    2 - Learn Model
    ./d2v/d2v -train treebank.d2v -output ../treebank_d2v.bin -binary 1 -hs 0 -window 10 -sample 0 -min-count 0 -negative 15 -sentence-vectors 1 -cbow 0 -iter 10 -threads 5

    3 - Predict accuracy
    python3.4 treebank_results.py treebank_d2v.bin


-------------------
SCRIPT LIST
-------------------

buildDatabase.py
usage: buildDatabase.py [--encoding ENCODING] [--gz] data type output

=> Script to build a sqlite3 database from type: ratebeer/beeradvocate/amazon/amazonjson/yelp datasets
    --> --gz if zipped
    --> encoding can be specified
------------------

db_to_R2V.py
usage: db_to_R2V.py [--min_count MIN_COUNT] [--min_sent_size MIN_SENT_SIZE] [--buff_size BUFF_SIZE] db output

=> Converts a sqlite3 database (db) to a file (output) with proper d2v format - Label Text
    --> remove words appearing less than min_count
    --> remove sentences with less than min_sent_size
    --> shuffle per buffer size
-----------------

predict_one.py
usage: predict_one.py [--n N] [--neigh NEIGH] [--mean_center] model db user item

=> Output a recommendation for (user,item) pair using (model) and (db)
    --> output n sentences as a review prediction
    --> use user or item as neighbour similarity
    --> mean normalize ratings
----------------

predict_review.py
usage: predict_review.py [--neigh NEIGH] model db

=> Output Mean Rouge for full review prediction (model) on the test reviews in (db)
    --> use user or item as neighbour similarity
----------------

predict_rating.py
usage: predict_rating.py [--neigh NEIGH] model db

=> Output MSE for (model) on the test reviews in (db)
    --> use user or item as neighbour similarity
----------------

predict_sentences.py
usage: predict_sentences.py [--n N] [--neigh NEIGH] model db

=> Output Mean Rouge for multi-sent prediction (model) on the test reviews in (db)
    --> use user or item as neighbour similarity
----------------

generate_wordcloud.py
usage: generate_wordcloud.py [--n N] model word

=> Generates wordcloud for word or label (word)
    --> takes the n closests word
----------------

auto_wordclouds.py
usage: auto_wordclouds.py [--n N] model name_model

=> Generates wordclouds for the 5 ratings in (model) outputs 5 * (name_model_<rating>.png)
    --> takes the n closests word

----------------

DB_baselines.py
usage: DB_baselines.py  [--latent LATENT] [--epochs EPOCHS][--alpha ALPHA] [--reg REG] db

=> Compute Classical baselines on (db) test set
    --> latent space size
    --> number of training epochs
    --> gradient step
    --> regularization strength
----------------

rouge_baseline.py
usage: rouge_baseline.py [--neigh NEIGH] [--n N] db

=> Compute Rouge-1,2,3 baseline on (db) test set
    --> using user or item similarity
    --> with 0,1 or (>= 2) multiple sentences /!\ can be very long /!\
----------------

treebank_to_R2V.py
usage: treebank_to_R2V.py [--output OUTPUT] [--classes CLASSES][--full_sentences FULL_SENTENCES] datafolder

=> Format treebank in (datafolder) for d2v format
    --> choose output
    --> number of classes (2 or 5)
    --> train sentences include labels for near sentence prediction
----------------

treebank_results.py
usage: treebank_results.py [--classes CLASSES] [--near_sent NEAR_SENT] model

=> Compute Treebank accuracy
    --> 2 or 5 classes
    --> near sentences instead of near opinion labels


------------------
CREDITS
------------------
-> Code for d2v from Tomas Mikolov et al. - https://code.google.com/p/word2vec/
-> Code to load sentiment treebank from Thomas Moreau - https://github.com/tomMoral/sentana
cedias/r2v