Data Mining for Yelp Dataset

Authors

Group 16 (Ordered by alphabet)

Bian WU [Email] [GitHub]
Lingzhi CAI [Email][GitHub]
Shenggui LI [Email] [GitHub]
Yanxi ZENG [Email] [GitHub]
Yuanming LI [Email] [GitHub]

Install

Please make sure that you have installed Conda, and have at least one CUDA device.

# install Conda env
conda env create -f environment.yml
# activate Conda env
conda activate ntu-dm

# add pytroch with specific cuda version
conda install pytorch cudatoolkit=<your cuda version> -c pytorch -y

# create data folders
mkdir -p {data,output}

# set python path when run
export PYTHONPATH="${PWD}"

Download Data

Download dataset
We are using Yelp dataset provided by Kaggle. The dataset contains 5 JSON files, 8 GB after unzipped. Please download the data from Yelp or Kaggle, move it into data folder and unzip it:
```
mv yelp-dataset.zip data/
cd data
unzip yelp-dataset.zip
cd -
```

Download nltk model

python -c "import nltk; nltk.download('stopwords'); nltk.download('punkt')"

Download pre-trained weights Download weights and data, move and unzip it:
```
mv 4032-data.zip data/
cd data
unzip 4032-data.zip
cd -
```

Prepare Data

You should seek for help if you do not have a high performance computer with memory larger than 40GB, as creating the database from Yelp dataset invokes large data frame processing. If you do not have access to such resources, please drop us a email for the request of linux-built SQLite database file. The reason that we do not provide a processed file via google drive is as following from Yelp Dataset Term of Use:

4
A. display, perform, or distribute any of the Data, or use the Data to update or create your own business listing information (i.e. you may not publicly display any of the Data to any third party, especially reviews and other user generated content, as this is a private data set challenge and not a license to compete with or disparage with Yelp);
...
E. create, redistribute or disclose any summary of, or metrics related to, the Data (e.g., the number of reviewed business included in the Data and other statistical analysis) to any third party or on any website or other electronic media not expressly covered by this Agreement, this provision however, excludes any disclosures necessary for academic purposes, including without limitation the publication of academic articles concerning your use of the Data;
...
H. rent, lease, sell, transfer, assign, or sublicense, any part of the Data;
...
I. modify, rate, rank, review, vote or comment on, or otherwise respond to the content contained in the Data;

# create SQLite database
python scripts/create_tables.py

# loading data into the database (slow)
python scripts/load_data.py

# process dataset (very slow)
python scripts/process_dataset.py

# pretrain model
python scripts/pretrain-model.py

Examples

Note: You output files are all in output/

Statistical Learning Models

Note: SVM and XGBoost may takes a very long time in prediction and testing.

Train XGBoost model for predicting usefulness

python train-statistical-learning-models.py xgboost

Train SVM model for predicting usefulness

python train-statistical-learning-models.py svm

Train Logistic Regression model for predicting usefulness
```
python train-statistical-learning-models.py logistic
```

Predict summary report

python helper.py pred-statistical <path/to/saved/model.pkl>

Plot ROC graph
```
python helper.py plot-roc <path/to/saved/model.pkl>
```
Note: when you run it in shell, you shall enable X server first.

Deep Learning Models

Train user elite classification
```
python train-user-elite.py
```
You may see script arguments by
```
python train-user-elite.py -h
```
Train LSTM usefulness classification
```
python train_text_lstm.py
```
You may see script arguments by
```
python train_text_lstm.py -h
```
Train multimodal classifier using pretrained LSTM and user elite model For pretrained TextLSTM model, you need to put the mapping.pickle, pretrained_weights.npy and useful_pred_lstm_weights.pth in ./data folder.
```
python train-multimodal-classifier.py
```
You may want to change the configuration by supplying another configuration files:
```
python train-multimodal-classifier.py --config=<path/to/config.yaml>
```
You may see script arguments by
```
python train-multimodal-classifier.py -h
```
Visualize loss and accuracy
```
python helper.py plot <path/to/your/statistic/result.pkl>
```
Note: when you run it in shell, you shall enable X server first.

Find confusion matrix

python helper.py confusion-mtx --name <model-name> --model-weight <model/weight/path.pth> \
--split-ratio 0.2 <model/configuration/path.yaml>

split-ratio is not needed for visualizing the TextLSTM alone.