The data set contains 5331 positive and 5331 negative reviews in a file. This project helps to quantify each review as positive or negative based on the content of reviews by providing a classifier which will flag all incoming reviews as positive or negative.
- Load the dataset and create a dataframe.
- Define a function which can perform the following functions:
- Remove non-alphabets
- Remove URLs
- Remove digits
- Remove stopwords
- Stem the texts using PorterStemmer
- Remove and replace “’”, “--”, “-”, “[”, “]” by “ ”.
- Create a list of 30 most frequently occurring words from cleaned reviews and write it to 'nlargest.txt'.
- Create a train (67%) and test (33%) split with random state 42
- Create a TF-IDF vector with the following parameter:
- ngram_range = (1,2)
- max_df=0.3
- min_df=7
- Build a Random Forest Classifier
- After building the classification model and predicting on the whole dataset, save confusion matrix to a text file using :
- confusion_matrix(observed,predicted).tofile('cfmatrix.txt',sep=',')
Reviews have been saved in separate files with extension ‘pos’ and ‘neg’.
Approx 10662 records
- Positive - 5331
- Negative - 5331
- /dataset/ReviewsFileName.xlsx
Since the number of reviews is surging with the rapid pace at movies database website. So, it is becoming almost impossible for admins to read each review and gauge the sentiment of patrons about the movies. This project will save over 90 manhours per month.