Large Movie Review Dataset v1.0
This dataset contains movie reviews along with their associated binary sentiment polarity labels. It is intended to serve as a benchmark for sentiment classification. This document outlines how the dataset was gathered, and how to use the files provided.
The core dataset contains 25,000 reviews split into train, development and test sets. The overall distribution of labels is roughly balanced.
Each folder contains files with negative (neg) and positive (reviews). One review per line.
@InProceedings{maas-EtAl:2011:ACL-HLT2011, author = {Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher}, title = {Learning Word Vectors for Sentiment Analysis}, booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies}, month = {June}, year = {2011}, address = {Portland, Oregon, USA}, publisher = {Association for Computational Linguistics}, pages = {142--150}, url = { http://www.aclweb.org/anthology/P11-1015 }
IMDb.py script requires python 2.7+. After printing the results of its performance on the test set, the model will ask the user to provide a new movie review. It will detect if this new review is positive or negative by returning 1 or 0 respectively.
- pandas
- TfidfVectorizer from sklearn.feature_extraction.text
- LogisticRegression from sklearn.linear_model
- classification_report from sklearn.metrics