This project is done for ODS NLP course. The new IMDb reviews dataset was collected in a similar way to the original one. Several models are trained to test the performance.
New IMDb review dataset was collected Tnree scripts:
- data/imdb_download.py downloads all movies from 2012 year to 2024 in a directory
- data/imdb_reviews_download.py for every movie downloads all its reviews
- Finally, in jupyter notebook the dataset is created. Labels are assigned, neutral reviews are removed, undersampling is performed for those movies that have too many reviews, and balancing of the dataset is done so that there is an equal number of positive and negative reviews.
There are two main models in this repository: tf-idf logistic regression and huggingface based one. To train model one should create a config.
An example config for tf-idf model: an example config
To launch training:
python train_tfidf.py --config config.json
Similarly, huggingface model can be trained. an example config
To launch training:
CUDA_VISIBLE_DEVICES=0 python trainer_transformer.py --config config.json
The code is based on this repository: https://github.com/e0xextazy/nlp_huawei_new2_task
Original IMDb movie reviews dataset: https://ai.stanford.edu/~amaas/data/sentiment/