/insa_kaggle

Repo containing all the files for the Kaggle competition hosted by INSA Toulouse

Primary LanguageJupyter Notebook

Welcome to our repo 👋

Description

This is our repository for the 5th edition of the so-called Défi IA.

This edition of the Défi IA pertains to NLP. The task is straightforward: assign the correct job category to a job description. This is thus a multi-class classification task with 28 classes to choose from.

Data

The data has been retrieved from CommonCrawl. The latter has been famously used to train OpenAI's GPT-3 model. The data is therefore representative of what can be found on the English speaking part of the Internet, and thus contains a certain amount of bias. One of the goals of this competition is to design a solution that is both accurate as well as fair. The train set contains 217,197 sample of job descriptions as well as their labels and genders. The test set contains 54,300 sample of job descriptions as well as their genders. This is the set used for submissions.

Evaluation

The original aspect of this competition is that there will be 2 tracks on which solutions will be ranked. First of all, solutions are ranked according to the Macro F1 metric, which will be used to build the Kaggle leaderboard. From Scikit-learn documentation :

The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal.

Then, submissions will be ranked according to their fairness with respect to the provided genders. To be specific, the average demographic parity across all classes (disparate impact) will be measured.

Essentially, we will look at the individual the disparate impact of each job with respect to both genders, and then compute the non-weighted average of these disparate impacts.


Results

We ended at the 14th place in the private leaderboard with a private score of 0.81026.

Below are all of our submissions from latest to first submission.

Submissions Private Score Public Score
10 Roberta Bert Xlnet 2 epochs voting lenght 256 0.81026 0.80776
9 Roberta Bert Xlnet 2 epochs voting 0.80636 0.80179
8 Roberta base 2 epochs first phrase 20 characs 0.69292 0.68064
7 Roberta base 2 epochs clean text 0.79886 0.79813
6 Roberta Large 3 epochs 0.77888 0.78147
5 Roberta Base 5 epochs 0.79442 0.79318
4 xlnet-base-cased 2 epochs 0.79953 0.79171
3 Roberta-base 2 epochs 0.79903 0.79789
2 Bert-base-cased 2 epochs 0.79475 0.79647
1 ULMFIT 0.78005 0.77978
  • Our best submission is the one where we assembled 3 models (Roberta, Bert and XLNet) trained on a length of 256 with a batchsize of 32 on 2 epochs. We used hard voting (majority) to assemble predictions.
  • We could have gotten better scores using soft voting and by aggregating more models.
  • Our best submission with a single model was with a Roberta model on the public leaderboard and with a XLNet model on the private leaderboard.

Computing Resources

Prototyping was done on our local desktop or using free cloud resources such as Google Colab or Kaggle Kernel.

Local desktop specs:

  • CPU : i7 4770K 8 threads
  • GPU : Nvidia GTX 1060 6 Go
  • Ram : 16 Go
  • OS : Ubuntu 18.04

Complete training was done on Google Cloud Platform VM:

  • CPU : Xeon 8 threads
  • GPU : Nvidia Tesla V100 16 Go
  • Ram : 30 Go
  • OS : Ubuntu 18.04

Runtime

  • Training runtime per epoch using Roberta base:
GPU lenght=128 batch=16 lenght=128 batch=32 lenght=256 batch=16 lenght=256 batch=32
GTX 1060 1h 23min Out of memory Out of memory Out of memory
Tesla T4 26 min 40 s 21 min 29 s 45 min 52 s 42 min 44s
Tesla V100 14 min 35 s 9 min 18 s 20 min 2 s 15 min 17 s

Choosing to use a VM with a Tesla V100 has allowed us to reduce training time considerably as shown in our runtime table. Knowing that we vary between 2 and 5 epochs, and that we have to test several models by changing the hyperparameters, the time saved in the end is immense.

  • Preprocessing runtime by preprocessing task :
Nb Core Cleaning Language Complexity Distance
1 30 s 13 min 57 s 2 min 19 s 1 h 56 min 21 s
2 16 s 6 min 57 s 55 s 1h 34 s
4 9 s 3 min 45 s 26 s 29 min 54 s
8 8 s 3 min 16 s 24 s 20 min 44 s

During pre-processing, we saw a very large saving in execution time by parallelizing our functions. This optimization step was not essential but this helps in trying multiple preprocessing pipeline quickly. The dataset was not big enough to scale up the number of CPU cores, therefore we kept preprocessing on our desktop rather than setting up another VM with extra CPU cores.

Task Description Implementation Speed-up (8 cores)
Cleaning Text cleaning regex 3.75
Language Detect language langdetect 4.27
Complexity Flesch Reading Ease textstat 5.79
Distance Levenshtein distance numpy+ numba 5.61

Reproductibility

Environment

In order to reproduce our result, you have to recreate our virtual environment. We recommend using conda to create an environment and to install the dependencies using the requirements.txt file.

conda create -n defi_ia python=3.7
conda activate defi_ia
pip install -r requirements.txt

Downloading data

You also have to download all the data in the /data directory. A bash script is provided to do so. You have to execute the bash script inside the data directory.

cd data/
bash download_data.sh

Training

To train a single or multiple model at the same time, use the provided script inside the /script directory.

Single model

You can chose the model and its hyperparamters in the beginning of the script run_single.py :

TEST = False
SAMPLE = 1000
EPOCH = 2
LENGTH = 128
BATCH = 16
FAMILY = "roberta"
FAMILYMODEL = "roberta-base"

Where :

  • TEST : whether you just want to see if the whole script run (True) on a subset of the data (SAMPLE) or not
  • EPOCH : number of epoch you want to train on
  • LENGHT : input lenght use in the transformers model, how many characters should the model take as input
  • BATCH : batchsize use for training and testing
  • FAMILY : type of model or architecture you want to use (bert, roberta, xlnet, see more)
  • FAMILYMODEL : model id (roberta-base, bert-base case, see more)

Once you have set the parameters of the model you want to train, execute the script inside the script directory:

cd script/
python run_single.py

The trained model will be saved as a pickle file in the /model directory and can be used for prediction later on. The saving format is as follow : FAMILYMODEL_LENGTH_EPOCH.pkl. Ex: roberta-base_32_2.pkl for a roberta-base model trained on a lenght of 32 for 2 epochs

Multiple model

Another script is provided to train multiple model at the same time. The same parameters can be found in the beginning of the script run_multiple.py except that you have to manually set the FAMILY and FAMILYMODEL inside the script :

model = ClassificationModel(
    FAMILY, FAMILYMODEL, num_labels=len(eval_df.labels.unique()), args=model_args)

The models are saved the same way as for training a single model (see before).

Prediction

To run prediction from a trained model, import it with pickle and use the predict method on data. This methods return 2 outputs : label prediction, probabilities of each label.

import pickle

model_path = "../model/"
data_path = "../data/"
test = pickle.load(open(data_path + "test.pkl", "rb"))

roberta = pickle.load(open(model_path+"roberta.pkl", "rb"))
test["prediction"], probabilities = roberta.predict(test.cleaned)

Change roberta.pkl into the name of your model. A few models are provided in the /model directory but you have to download them using the provided bash script :

cd model/
bash download_models.sh

Author

Team 3TP Mastère Spécialisé Valorisation des Données Massives (VALDOM)

👤 Premchanok BAMRUNG 👤 Thibault DELMON 👤 Thibaut HERNANDEZ 👤 Thomas HUSTACHE