how to generate yelp dataset from original yelp dataset

Hi, I was using the yelp dataset from Kaggle (https://www.kaggle.com/yelp-dataset/yelp-dataset/version/6?select=yelp_review.csv). I have some question how you created the training and validation dataset. I loaded the yelp_review.csv file and it has 5261668 reviews. I believe you joined the review and the business csv files to create the yelp_data_100.csv file. However, the size of training, validation and testing data sets are 87375, 8737, 37265 as reported in the paper. This is not consistent with the original data set size. Could you give me some suggestions how you created the yelp_data_100.csv file from the yelp_review.csv file? If I am not using the correct version of the yelp dataset, could you let me know which version you are using for your experiment? Thanks a lot!

Hi,

Sry for the delayed reply. It looks like Yelp has changed to Kaggle to share the data? The version we used was Yelp 2018 Round 11. There are more details on how we filtered the raw dataset in the paper appendix.

we first removed
categories that have fewer than 100 businesses and
then businesses that have fewer than 5 reviews.

I found in the code https://github.com/morningmoni/HiLAP/blob/master/loadData.py#L249
that the min review is 1, and max review is 10. Is this what you used for the experiment? I want to make it consistent so that I can compare between models.

If you look at main.py the parameters are set there:

HiLAP/main.py

Line 620 in 0496125

    
           X_train, X_test, train_ids, test_ids, id2doc, wv, word_index, nodes = load_data_yelp('../datasets/glove.6B.50d.txt',