CMPS242Project

This is the class project for CMPS 242 using data from the Yelp Dataset Challenge. Currently we only support python 2, and anything less than 8GB of ram will crash.

Preprocessing

Tools we need:

pandas

sudo pip install pandas

nltk

sudo pip install nltk

download packages in nltk

>>> import nltk
>>> nltk.download()

scikit-learn for comparison

sudo pip install -U scikit-learn

Data we need:

yelp_academic_dataset_business.json
yelp_academic_dataset_review.json

First use json_to_csv_converter.py from this repository to convert the json files into csv format (yelp_academic_dataset_business.json and yelp_academic_dataset_review.json to yelp_academic_dataset_business.csv and yelp_academic_dataset_review.csv).

Use these two commands on the shell.

python json_to_csv_converter.py yelp_academic_dataset_business.json
python json_to_csv_converter.py yelp_academic_dataset_review.json

Put these two files into 'data' directory. Then run Preprocess.py to generate pickled feature files. Here we randomly sample 1% of the dataset, since processing the entire dataset would take too much time. Optionally give flags (-u, -b, -l, -a, -t) to select the features to use. Run the following command to see detailed messages about the options.

python Preprocess.py -h

For instance, run the command below to generate feature files using unigrams, LIWC scores, and TF-IDF frequency:

python Preprocess.py -u -l -t

Then find all the features file will be put into the directory

jar_of_/pickle-l-t-u

##Modeling and Prediction Run the file Process.py with required keyword arguments (-c, -d) to train a model and predict. Run the following command to detailed messages about the required arguments.

python Process.py -h

For instance, run the command below to build a Naive Bayes classifier trained on the features selected above (unigrams, LIWC scores, and TF-IDF frequency).

python Process.py -c nb -d pickle-l-t-u

The prediction result will be printed to sys.stdout as follows, showing the accuracy, precision, recall, and F1 score:

Recall = number of results returned Precision = number of correct results returned Fscore = weighted average of the precision and recall Accuracy = correctness with respect to the anotatted data

=====================================================
Results:
  Accuracy:     0.752411455812
  Precision:    0.793605698051
  Recall:       0.930122403039
  F1 score:     0.856458090426
=====================================================

word_category_counter is a LIWC simulator script created and distributed within the NLDS lab. It uses the LIWC.dic data file to simulate the functionality of LIWC.

robert501128/CMPS242Project

CMPS242Project

Preprocessing