Group 16 (Ordered by alphabet)
- Bian WU [Email] [GitHub]
- Lingzhi CAI [Email][GitHub]
- Shenggui LI [Email] [GitHub]
- Yanxi ZENG [Email] [GitHub]
- Yuanming LI [Email] [GitHub]
Please make sure that you have installed Conda, and have at least one CUDA device.
# install Conda env
conda env create -f environment.yml
# activate Conda env
conda activate ntu-dm
# add pytroch with specific cuda version
conda install pytorch cudatoolkit=<your cuda version> -c pytorch -y
# create data folders
mkdir -p {data,output}
# set python path when run
export PYTHONPATH="${PWD}"
-
Download dataset
We are using Yelp dataset provided by Kaggle. The dataset contains 5 JSON files, 8 GB after unzipped. Please download the data from Yelp or Kaggle, move it intodata
folder and unzip it:mv yelp-dataset.zip data/ cd data unzip yelp-dataset.zip cd -
-
Download nltk model
python -c "import nltk; nltk.download('stopwords'); nltk.download('punkt')"
-
Download pre-trained weights Download weights and data, move and unzip it:
mv 4032-data.zip data/ cd data unzip 4032-data.zip cd -
You should seek for help if you do not have a high performance computer with memory larger than 40GB, as creating the database from Yelp dataset invokes large data frame processing. If you do not have access to such resources, please drop us a email for the request of linux-built SQLite database file. The reason that we do not provide a processed file via google drive is as following from Yelp Dataset Term of Use:
4
A. display, perform, or distribute any of the Data, or use the Data to update or create your own business listing information (i.e. you may not publicly display any of the Data to any third party, especially reviews and other user generated content, as this is a private data set challenge and not a license to compete with or disparage with Yelp);
...
E. create, redistribute or disclose any summary of, or metrics related to, the Data (e.g., the number of reviewed business included in the Data and other statistical analysis) to any third party or on any website or other electronic media not expressly covered by this Agreement, this provision however, excludes any disclosures necessary for academic purposes, including without limitation the publication of academic articles concerning your use of the Data;
...
H. rent, lease, sell, transfer, assign, or sublicense, any part of the Data;
...
I. modify, rate, rank, review, vote or comment on, or otherwise respond to the content contained in the Data;
# create SQLite database
python scripts/create_tables.py
# loading data into the database (slow)
python scripts/load_data.py
# process dataset (very slow)
python scripts/process_dataset.py
# pretrain model
python scripts/pretrain-model.py
Note: You output files are all in output/
Note: SVM and XGBoost may takes a very long time in prediction and testing.
-
Train XGBoost model for predicting usefulness
python train-statistical-learning-models.py xgboost
-
Train SVM model for predicting usefulness
python train-statistical-learning-models.py svm
-
Train Logistic Regression model for predicting usefulness
python train-statistical-learning-models.py logistic
-
Predict summary report
python helper.py pred-statistical <path/to/saved/model.pkl>
-
Plot ROC graph
python helper.py plot-roc <path/to/saved/model.pkl>
Note: when you run it in shell, you shall enable X server first.
-
Train user elite classification
python train-user-elite.py
You may see script arguments by
python train-user-elite.py -h
-
Train LSTM usefulness classification
python train_text_lstm.py
You may see script arguments by
python train_text_lstm.py -h
-
Train multimodal classifier using pretrained LSTM and user elite model For pretrained TextLSTM model, you need to put the mapping.pickle, pretrained_weights.npy and useful_pred_lstm_weights.pth in ./data folder.
python train-multimodal-classifier.py
You may want to change the configuration by supplying another configuration files:
python train-multimodal-classifier.py --config=<path/to/config.yaml>
You may see script arguments by
python train-multimodal-classifier.py -h
-
Visualize loss and accuracy
python helper.py plot <path/to/your/statistic/result.pkl>
Note: when you run it in shell, you shall enable X server first.
-
Find confusion matrix
python helper.py confusion-mtx --name <model-name> --model-weight <model/weight/path.pth> \ --split-ratio 0.2 <model/configuration/path.yaml>
split-ratio is not needed for visualizing the TextLSTM alone.