IMDB Movie Reviews Sentiment Analysis using LSTM (Long-Short Term Memory) Networks
This project performs sentiment analysis by classification an IMDB movie review into positive (1) and negative (0)
Description of the Dataset
The core dataset contains 50,000 reviews split evenly into 25,000 training and 25,000 testing sets. The overall distribution of labels is balanced. You can learn more about this dataset: http://ai.stanford.edu/~amaas/data/sentiment/
Features
- Review: IMDB Moview Review Text.
- Sentiment: 1 (positive), 0 (negative)
Dependencies
- python - Programming Language
- tensorflow - TensorFlow is an open-source machine learning library for research and production
- keras - Keras is a high-level neural networks API
- sklearn - Scikit-learn is a free software machine learning library for the Python
- numpy - NumPy is the fundamental package for scientific computing
- pandas - Pandas is a software library used for data manipulation and analysis
Steps of Appylying the Machine Algorithm
Step 1. Get data using get_reviews_data
- loops through multiple text files, imports the reviews into a list, cleans, and stores it into pandas dataframe
Step 2. Prepares the data for machine learning using preprocess_text_data
- tokenizes the Textual data
- creates a padded sequence of numbers representing the textual review data
- returns a numpy array
Step 3. Create a Baseline LSTM (Long Short Term Memory) Neural Network using create_baseline_model
method
- be careful when designing the model. Here input_dim of the emmbedding layer should be greater than the size of input vocabulary.
The size of the vocabulary can be calculated by
print(tokenizer.word_index)
- The output layer will be
sigmoid
layer consisting of 1 neuron (for binary classification) - This will output fuzzy values ranging between 0 and 1 depicting the class probabilities.
compile
the model using loss function, optimizer, and metric: Here, we have chosenbinary_crossentropy
as the loss function since we are performing classification task. Optimiser isadam
and metric isaccuracy
Step 4. split_and_fit_dataset
train_test_split
splits the data in the ratio of test-size to the total dataset, i.e. test_size = 0.2 implies train_size = 0.8fit
the training data(x_train, y_train)
in the model usingmodel.fit
method
Step 5 evaluate
the model performance on test set using model.evaluate
method.
- print the
test_acc
andtest_loss
. This provides the model performance on test_data
Step 5. Iteratively tune the model hyperparameters
- Tune the hyperparameters of the network such as learning rate, activation functions, optimizer, depth of the network (number of hidden layers), width of layers (no. of neurons in each layer), DropOut layers, batch size, epochs, L1 or Ll2 regularization. This is basically the part where the magic happens.
Hyperparameters
Activation function: tanh
Optimizer: adam
learning rate,α
: 0.01
(default)
dropout: 0.2
, 0.9
train-test: 0.8/0.2
batch_size: 32
epochs: 50
loss function: binary_crossentropy
Results
Accuracy: 55.15%
Author
Citations
- Maas, A., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., & Potts, C. (2011, June). Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies (pp. 142-150).
- http://ai.stanford.edu/~amaas/data/sentiment/
- https://machinelearningmastery.com/predict-sentiment-movie-reviews-using-deep-learning/