Class assignment for language analytics class at Aarhus University.
2021-04-19
This assignment is Class Assignment 6. The task was to create two scripts for text classification: one for Logistic Regression (LR) Classifier, another for Deep Learning Convolutional Neural Network model (DL CNN). Both classifiers try to model the relationship between each season of Game of Thrones and the lines spoken. That is to say - can it predict which season a line comes from? Or to phrase that another way, is dialogue a good predictor of season?
LR model takes TV series Game of Thrones script as an input, trains a Logistic Regression (LR) Classifier, prints the evaluation metrics to the terminal and saves classification report and confusion matrix in a directory. DL CNN model takes the same input, trains a DL CNN classifier, prints the evaluation metrics to the terminal, and saves classification report and a performance graph in a directory.
The problem of the task relates to classifying seasons of TV series Game of Thrones. To address this problem, firstly, I have used a 'classical' machine learning solution such as CountVectorization + LogisticRegression to establish a baseline model performance. Afterwards, I have employed a Deep Learning CNN model using a neural network framework TensorFlow 2.4. CNNs are useful in text classification tasks, because they can model local structure in text (with their immediate context) and are more sophisticated that LR models, therefore suitable for complex cultural data. The CNN´s architecture consists of Embedding layer, Convolutional Layer (CONV) with ReLU activation function, Global Max Pooling layer (GlobalMAXPOOL) and a Fully Connected Layer (FC). The output layer (OUT) uses softmax activation function and has 8 possible classes.
Model´s architecture: Embedding -> CONV+ReLU -> GlobalMAXPOOL -> FC+ReLU -> OUT(softmax)
CNNs are prone to overfitting, therefore I applied a weight regularization method to CONV and FC layers to minimize the overfitting. I have used L2 regularization to constrain how the model performs (l2 = L2(0.0001)).
Depiction of the full model´s architecture can be found in folder called 'out'.
File | Description |
---|---|
data/ | Folder containing files input data for the script |
data/Game_of_Thrones_Script.csv | CSV file used as input for the script |
out/ | Folder containing files produced by the scripts |
out/CNN_Model´s_architecture.png | CNN model´s architecture |
out/GoT_CNN_classification_performance_graph.png | Performance graph of CNN classifier |
out/GoT_CNN_classifier_report.csv | Classification metrics of the CNN classifier |
out/GoT_LR_classification_report.csv | Classification metrics of the LR classifier |
out/GoT_LR_confusion_matrix.png | Confusion matrix of LR classifier |
src/ | Folder containing the scripts |
src/GoT_deep.py | CNN classifier script |
src/GoT_LogReg.py | Logistic Regression classifier script |
utils/ | Folder containing utility script for the project |
utils/classifier_utils.py | utility script used in LR classifier script |
LICENSE | A software license defining what other users can and can't do with the source code |
README.md | Description of the assignment and the instructions |
create_GoT_venv.bash | bash file for creating a virtual environmment |
kill_GoT_venv.bash | bash file for removing a virtual environment |
requirements.txt | list of python packages required to run the script |
Data used is a complete set of Game of Thrones script for all seasons (8 seasons in total, 10 episodes first 6 seasons, 7 and 6 episodes for the last two seasons) in form of a table containing 6 columns with different data types used for various purposes. Description of each column is provided in the data description part in a link below. For this assignment, columns 'Season' and 'Sentence' were used.
Link to data: https://www.kaggle.com/albenft/game-of-thrones-script-all-seasons
Data preprocessing
The preprocessing of data for LR model included the following step:
- vecorizing training and test data using
sklearn
CountVectorizer()
, which transformed text into a vector on the basis of the frequency (count) of each word that occurs in the entire text.
The preprocessing of data for DL CNN model included the following steps:
- tokenizing training and test data using
tensorflow.keras.Tokenizer()
which quickly and efficiently convert text to numbers - to make the Tokenizer output workable, the documents are padded to be of equal length (maxlen = 100)
- labels transformed to binarized vectors
Both codes were tested on an HP computer with Windows 10 operating system. They were executed on Jupyter worker02.
Codes parameters
Logistic Regression classifier
Parameter | Description |
---|---|
data_dir (dir) | Directory of the CSV input file |
test_size (tes) | The size of the testing data as a percentage. Default = 0.25 (25%) |
DL CNN classifier
Parameter | Description |
---|---|
data_dir (dir) | Directory of the CSV input file |
test_size (tes) | The size of the testing data as a percentage. Default = 0.2 (20%) |
optimizer (optim) | Method to update the weight parameters to minimize the loss function. Default = Adam |
epochs (ep) | Defines how many times the learning algorithm will work through the entire training dataset. Default = 20 |
Steps
Set-up:
#1 Open terminal on worker02 or locally
#2 Navigate to the environment where you want to clone this repository
#3 Clone the repository
$ git clone https://github.com/Rutatu/cds-language_Assignment_6.git
#4 Navigate to the newly cloned repo
$ cd cds-language_Assignment_6
#5 Create virtual environment with its dependencies and activate it
$ bash create_GoT_venv.sh
$ source ./GoT/bin/activate
Run the code:
#6 Navigate to the directory of the scripts
$ cd src
#7 Run each code with default parameters
$ python GoT_LogReg.py -dir ../data/Game_of_Thrones_Script.csv
$ python GoT_deep.py -dir ../data/Game_of_Thrones_Script.csv
#8 Run each code with self-chosen parameters
$ python GoT_LogReg.py -dir ../data/Game_of_Thrones_Script.csv -test 0.2
$ python GoT_deep.py -dir ../data/Game_of_Thrones_Script.csv -test 0.3 -optim SGD -ep 50
#9 To remove the newly created virtual environment
$ bash kill_GoT_venv.sh
#11 To find out possible optional arguments for both scripts
$ python GoT_LogReg.py --help
$ python GoT_deep.py --help
I hope it worked!
LR classifier achieved a weighted average accuracy of 26% for correctly classifying TV series Game of Thrones seasons. DL CNN classifier achieved a weighted average accuracy of 24%, which is slightly worse than LR classifier. Such results can indicate that it was a very challenging task to classify TV series seasons according to lines spoken, and as simple approach as LR can perform better. This might have happened for various reasons: not enough data (only 10 episodes per season, 8 seasons in total), seasons of TV series Game of Thrones might be specifically hard to classify due to the nature of conversations, shallow or not relevant data preprocessing steps.
GoT_CNN_classification_performance_graph suggest that overfitting might have been a problem - validation loss curve was increasing during the whole training which created a huge gap with training loss curve, while training accuracy reached an accuracy close to 100%. We cannot yet conclude that scripts of TV series are not suitable for text classification tasks, such as classifying seasons. More experimentation with different datasets and hyperparameters is needed.