Text classification, a crucial application of Natural Language Processing (NLP), finds its relevance in various industries. This project focuses on the application of Recurrent Neural Networks (RNN) and Long Short-Term Memory (LSTM) models for text classification. RNNs and LSTMs excel in handling sequential data, making them ideal choices for NLP tasks. The project specifically targets customer complaints about consumer financial products.
The primary objective is to leverage RNN and LSTM models for text classification on a dataset containing over two million customer complaints about consumer financial products.
The dataset includes customer complaints, each associated with a product category. The text of the complaint and the corresponding product category are provided. To enhance text representation, pre-trained word vectors from the GloVe dataset (glove.6B) are employed.
- Language:
Python
- Libraries:
pandas
,torch
,nltk
,numpy
,pickle
,re
,tqdm
,sklearn
Install necessary packages using the pip
command. Import the required libraries for the project.
Define configuration file paths for managing data and model-related parameters.
- Read the GloVe text file.
- Convert embeddings to a float array.
- Add embeddings for padding and unknown items.
- Save embeddings and vocabulary.
- Read the CSV file and handle null values.
- Address duplicate labels.
- Encode the label column and save the encoder and encoded labels.
- Convert text to lowercase.
- Remove punctuation, digits, and additional spaces.
- Tokenize the text.
Construct a data loader for efficient model training.
- Define RNN architecture.
- Define LSTM architecture.
- Create functions for training and testing the models.
- Train the RNN model.
- Train the LSTM model.
Make predictions using the trained models on the test data.
-
Input: Contains data required for analysis, including:
complaints.csv
glove.6B.50d.txt
(download from here)
-
Source: Contains modularized code for various project steps, including:
model.py
data.py
utils.py
These Python files contain functions used in the
Engine.py
file. -
Output: Contains files required for model training, including:
embeddings.pkl
label_encoder.pkl
labels.pkl
model_lstm.pkl
model_rnn.pkl
vocabulary.pkl
tokens.pkl
(The
model_lstm.pkl
andmodel_rnn.pkl
files are our saved models after training) -
config.py: Contains project configurations.
-
Engine.py: The main file to run the entire project, which trains the models and saves them in the output folder.