Movie Review classification using Random Forest Classifier

The data set contains 5331 positive and 5331 negative reviews in a file. This project helps to quantify each review as positive or negative based on the content of reviews by providing a classifier which will flag all incoming reviews as positive or negative.

Tasks Performed:

Load the dataset and create a dataframe.
Define a function which can perform the following functions:

Remove non-alphabets
Remove URLs
Remove digits
Remove stopwords
Stem the texts using PorterStemmer
Remove and replace “’”, “--”, “-”, “[”, “]” by “ ”.

Create a list of 30 most frequently occurring words from cleaned reviews and write it to 'nlargest.txt'.
Create a train (67%) and test (33%) split with random state 42
Create a TF-IDF vector with the following parameter:

ngram_range = (1,2)
max_df=0.3
min_df=7

Build a Random Forest Classifier
After building the classification model and predicting on the whole dataset, save confusion matrix to a text file using :

confusion_matrix(observed,predicted).tofile('cfmatrix.txt',sep=',')

Considerations:

Reviews have been saved in separate files with extension ‘pos’ and ‘neg’.

Data volume

Approx 10662 records

Positive - 5331
Negative - 5331

Dataset Path:

/dataset/ReviewsFileName.xlsx

Business benefits: