/sentiment-analysis

Comparative analysis of different encoding schemes of Bag-of-Words for sentiment analysis using Neural Network.

Primary LanguageJupyter NotebookApache License 2.0Apache-2.0


Is this movie worth watching?

Sentiment Analysis: a Bag-of-Words approach

What Is It?

Sentiment Analysis is the use of natural language processing (NLP) techniques to study the affective states and subjective information. This is widely used to summarize customer opinions and reviews for applications such as marketing, product improvement, customer service, etc. Similarly, analyzing of movie reviews is also done, to summarize the movie-goers opinion towards the movie, or rate the overall movie. These reviews are rated as positve or negative.

Aim of the project is to study various modes in bag-of-words model, and build a neural network model to predict the sentiment of the movie reviews.

(back to top)

Summary

The required data is provided by the Cornell University and can be downloaded directly from here (polarity dataset v2.0). The dataset contains both positive and negative movie reviews. Each review is stored as a separate document in a text file.

To achieve the above mentioned aim, review documents are analyzed to understand the preprocessing steps that could help clean the documents (data_analysis.ipynb) (we perform our own custom preprocessing steps and do not depend on any preprocessing libraries). A vocabulary file is generated from the training documents (helper_vocab.py). After comparing the four text encoding schemes, i.e. binary, count, tfidf, and frequency, of bag-of-words model, it is noticed that binary encoding scheme achieves highest accuracy of 92.22% (comparative_analysis.ipynb). In the final model training, binary encoding scheme is used, and model fine tuning is performed (training_final_model.ipynb).

Accuracies obtained from training the network:

BOW Encoding Scheme Accuracy
binary 0.9222
count 0.8911
tfidf 0.8704
freq 0.8601
Final Model (binary+fine-tuned) 0.8849

NOTE: Training accuracy obtained after training final model is 0.8849 and validation accuracy is 0.8842.

NOTE: During comparative analysis, the models may not be as robust as the final model because they were not validated or not even regularized.

Limitations & Future Work:

  1. The project uses entire available vocabulary and that leads to sparse matrix. Recommended to try different vocabulary sizes.

  2. Every training document is used as it is. Experimentions with truncated reviews could be tried, as that reduces the matrix size, and thus sparsity.

(back to top)

Directory Structure

├── Data                           # Data files
    ├── Raw                        # Raw files (zip folder)
    │   ├── neg                    # Negative review files/documents
            ├── .... 
    │   ├── pos                    # Positive review files/documents
            ├── ....
    └── Vocab                      # Vocabulary files
    │   ├── vocab_all_occ.txt      # Entire vocabulary obtained from files
    │   ├── vocab_min_occ.txt      # Vocabulary file having words with minimum occurance
├── Models                         # Saved trained models
    ├── ....                        
├── comparative_analysis.ipynb     # Comparative analysis all the bag-of-word modes
├── data_analysis.ipynb            # Analysing the review documents
├── helper_analysis.py             # Python script for analysis
├── helper_vocab.py                # Python script for vocabulary processing/creation
├── predict_.py                    # Python script for predicting sentiments
├── training_final_model.ipynb     # Final model training/tuning

NOTE: 'Raw' folder is compressed.

(back to top)

Languge and Libraries

  • Language: Python
  • Libraries: NLTK, Keras, Tokenizer, WordCloud, Re, Matplotlib, Seaborn, Numpy, Pandas, String.

(back to top)

Final Notes

To run the entire project use JupyterLab or similar IDE.

NOTE: Notebooks use python scripts to run.

To run the python scripts:

$ python helper_analysis.py
$ python helper_vocab.py
$ python predict_.py

(back to top)