Table of Contents
This repository contains an implementation of a Naive Bayes classifier for sentiment analysis on the IMDB movie reviews dataset using the Scikit-learn and TensorFlow libraries.
This is an example of how you may give instructions on setting up your project locally. To get a local copy up and running follow these simple example steps.
This repository contains a TensorFlow-based implementation of a Convolutional Neural Network (CNN) for image classification on the CIFAR-10 dataset.
This repository contains an implementation of a Naive Bayes classifier for sentiment analysis on the IMDB movie reviews dataset using the Scikit-learn and TensorFlow libraries.
To run this code, you need to have the following packages installed:
- Numpy
- TensorFlow
- Scikit-learn
- NLTK
You can install these packages using pip:
pip install numpy tensorflow scikit-learn nltk
The IMDB movie reviews dataset consists of 50,000 movie reviews, with 25,000 for training and 25,000 for testing. Each review is labeled as either positive (1) or negative (0), indicating the sentiment of the reviewer.
The classifier is implemented using the Multinomial Naive Bayes algorithm from the Scikit-learn library. The text reviews are preprocessed using the CountVectorizer, which vectorizes the reviews and removes common English stopwords using the NLTK library.
The classifier is trained on the training dataset and evaluated on the test dataset. The test accuracy is calculated and printed.
To train and evaluate the model, simply run the provided code in a Python environment with the required dependencies installed.
python imdb_naive_bayes.py
This will train the model and print the test accuracy upon completion.
This will train the model and print the test accuracy upon completion.
- Decode the reviews using the IMDB dataset's word index.
- Merge the training and test datasets to create a combined dataset for preprocessing.
- Vectorize the reviews using CountVectorizer and remove common English stopwords using the NLTK library.
- Split the combined dataset back into training and testing sets.
- Import the necessary libraries (NumPy, TensorFlow, Scikit-learn, NLTK).
- Load the IMDB dataset and word index.
- Define a function to decode a review using the word index.
- Prepare the data by decoding the reviews, merging the datasets, and vectorizing the text.
- Train a Multinomial Naive Bayes classifier on the training dataset.
- Test the model on the test dataset and calculate the test accuracy.
- Print the test accuracy.
To run this code, you need to have the following packages installed:
- TensorFlow (2.x)
- NumPy
- Scikit-learn
- NLTK
You can install these packages using pip:
pip install tensorflow numpy scikit-learn nltk
The dataset used for this project is the IMDB movie review dataset. It contains 50,000 movie reviews split into a training set (25,000 reviews) and a testing set (25,000 reviews). Each review is labeled as positive (1) or negative (0).
The model used for classification is the Multinomial Naive Bayes classifier from the Scikit-learn library. This classifier is suitable for text classification tasks, especially when the dataset is large and sparse.
To train and evaluate the model, simply run the provided code in a Python environment with the required dependencies installed.
python imdb_naive_bayes.py
Performance The Multinomial Naive Bayes classifier is expected to perform well on the IMDB movie review dataset due to its simplicity and suitability for text classification tasks. The performance may vary depending on the specific dataset used and the preprocessing techniques applied.
The dataset is preprocessed using the following steps:
- Decoding the IMDB reviews from integer sequences back to text.
- Combining the training and testing sets to create a single dataset for preprocessing.
- Vectorizing the reviews using the
CountVectorizer
from Scikit-learn library. This converts the text data into a bag-of-words representation. - Removing English stopwords using the Natural Language Toolkit (NLTK).
- Splitting the combined dataset back into training and testing sets.
This project is licensed under the MIT License. You are free to use, modify, and distribute the code as long as the original copyright and permission notice are included.
If you have any suggestions or improvements for this project, feel free to contribute. You can fork the repository, make your changes, and submit a pull request. We appreciate your contributions and will review them as soon as possible.
- Fork the repository on GitHub.
- Clone your fork of the repository:
git clone https://github.com/KeyArgo/IMDBSentimentAnalysis.git
- Create a new branch for your changes:
git checkout -b my-feature-branch
- Make your changes to the code or documentation.
- Commit your changes:
git commit -am 'Add my new feature'
- Push your changes to your fork:
git push origin my-feature-branch
- Create a new pull request on the original repository.
For any issues or questions, please open an issue on the GitHub repository.
The dataset used in this project is provided by the TensorFlow library and is originally from the IMDB Movie Review Dataset. The implementation is based on the Multinomial Naive Bayes classifier from the Scikit-learn library and preprocessing techniques from the NLTK library.
Daniel LaForce - https://github.com/KeyArgo
Please feel free to reach out with any questions, suggestions, or feedback.
python cifar10_cnn.py
This will train the model and print the test accuracy upon completion.
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature
) - Commit your Changes (
git commit -m 'Add some AmazingFeature'
) - Push to the Branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
Distributed under the MIT License. See LICENSE.txt
for more information.
Daniel LaForce - danlaforce3@gmail.com
Project Link: https://github.com/KeyArgo/IMDBSentimentAnalysis