Text classification
A bit of deep learning code that I'm planning on using in kaggle competitions to get up to speed in NLP.
Model description
Model
This model is mainly composed of:
- Embeddings layer (possibility to use pre-trained GloVe embeddings)
- Convolution layer
- LSTM layer
- Attention mechanism
- Dense layers
- Softmax as the last layer
Its complexity varies depending on the embeddings dimension, and is roughly around 430,000 parameters.
Loss is categorical cross-entropy, minimised with the Adam algorithm.
Example with embeddings of dimension 10:
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_2 (InputLayer) [(None, 10)] 0
__________________________________________________________________________________________________
embedding_1 (Embedding) (None, 10, 10) 180 input_2[0][0]
__________________________________________________________________________________________________
conv1d_1 (Conv1D) (None, 10, 32) 992 embedding_1[0][0]
__________________________________________________________________________________________________
max_pooling1d_1 (MaxPooling1D) (None, 5, 32) 0 conv1d_1[0][0]
__________________________________________________________________________________________________
batch_normalization_3 (BatchNor (None, 5, 32) 128 max_pooling1d_1[0][0]
__________________________________________________________________________________________________
lstm_1 (LSTM) (None, 5, 100) 53200 batch_normalization_3[0][0]
__________________________________________________________________________________________________
dense_6 (Dense) (None, 5, 1) 101 lstm_1[0][0]
__________________________________________________________________________________________________
flatten_2 (Flatten) (None, 5) 0 dense_6[0][0]
__________________________________________________________________________________________________
activation_1 (Activation) (None, 5) 0 flatten_2[0][0]
__________________________________________________________________________________________________
repeat_vector_1 (RepeatVector) (None, 100, 5) 0 activation_1[0][0]
__________________________________________________________________________________________________
permute_1 (Permute) (None, 5, 100) 0 repeat_vector_1[0][0]
__________________________________________________________________________________________________
multiply_1 (Multiply) (None, 5, 100) 0 lstm_1[0][0]
permute_1[0][0]
__________________________________________________________________________________________________
flatten_3 (Flatten) (None, 500) 0 multiply_1[0][0]
__________________________________________________________________________________________________
dense_7 (Dense) (None, 500) 250500 flatten_3[0][0]
__________________________________________________________________________________________________
dropout_3 (Dropout) (None, 500) 0 dense_7[0][0]
__________________________________________________________________________________________________
batch_normalization_4 (BatchNor (None, 500) 2000 dropout_3[0][0]
__________________________________________________________________________________________________
dense_8 (Dense) (None, 200) 100200 batch_normalization_4[0][0]
__________________________________________________________________________________________________
dropout_4 (Dropout) (None, 200) 0 dense_8[0][0]
__________________________________________________________________________________________________
batch_normalization_5 (BatchNor (None, 200) 800 dropout_4[0][0]
__________________________________________________________________________________________________
dense_9 (Dense) (None, 100) 20100 batch_normalization_5[0][0]
__________________________________________________________________________________________________
dropout_5 (Dropout) (None, 100) 0 dense_9[0][0]
__________________________________________________________________________________________________
dense_10 (Dense) (None, 10) 1010 dropout_5[0][0]
__________________________________________________________________________________________________
dense_11 (Dense) (None, 1) 11 dense_10[0][0]
==================================================================================================
Total params: 429,222
Trainable params: 427,758
Non-trainable params: 1,464
__________________________________________________________________________________________________
Training was using 400% CPU and 13G RAM on my laptop.
Results
Tested on Kaggle's real or not, NLP with disaster tweets, results are pretty bad for now.
Preprocessing
Words are tokenized using Keras' tokenizer. The sequence length is fixed and defined in the model parameters. Any tokenized sequence that has more words that the fixed length will be truncated, and the shorter ones will be zero-padded.
Labels are one-hot encoded with scikit-learn's LabelBinarizer.
Evaluation
Evaluation is performed with Keras' function "evaluate" and reports loss and accuracy across all classes. The model is evaluated on the training set with cross validation.
Architecture
classifier package
This is the main python package. It contains:
- model.py: wrapper that builds the keras model
- preprocessing.py: contains a class for text preprocessing (tokenizing and zero-padding)
- pipeline.py: chains the preprocessing and the model, and formats the labels
- utils: functions for data loading and reading pre-trained GloVe embeddings.
Tests
Contains one test file per module in the classifier package. These tests are designed to run using pytest.
data
Folder I use to store glove embeddings and training/test data.
models
Where I initially wanted to store the trained models. This functionality is not yet supported.
Usage
This code can be used in a python script or from the command line. The script main.py shows an example of how to create a text classification pipeline, cross-validate the model and generate predictions.
Command line
the repository can be run from outside this folder. All arguments start with "--"
python keras_text_classifier --training_data path_to_file.csv
Arguments:
- training_data: path to training data
- prediction_data: path to prediction data
- output_predictions: path where to output the predictions
- sequence_length: maximum number of words taken into account by the model
- embeddings_path: path to pretrained glove embeddings.
- embeddings_dim: dimension of the embeddings. If the embeddings path is given, this number must match the dimension of the embeddings.
- batch_size: batch size when training the model. Defaults to 32.
- epochs: number of epochs to train the model. Defaults to 1.
Dependencies
Built on:
- Tensorflow 2.3.1
- pandas 1.1.3
- scikit-learn 0.23.2
Running tests
Using pytest (my version: 6.6.1), go to the repository's directory and execute:
python -m pytest tests