/NLPChallenge2021

AI Challenge hosted by INSA-Toulouse

Primary LanguageJupyter Notebook

NLPChallenge2021

AI Challenge hosted by INSA-Toulouse

TL;DR :

  • Ensemble of RoBERTa-large : 20 x RoBERTa-large fine tuned on full train dataset & 24 x RoBERTa NLI fine tuned on augmented dataset
  • No preprocessing (on text)
  • WeTried to use Beyond-Back Translation & GenderSwap methods in order to tackle the fairness problem.
  • For the fine tuning part, we trained on Kaggle TPUs

Report

Fairness and Job description classification (FR)

  • Preprocessing: How we perform label cleaning, data augmentation and gender swapping before training our models to improve fairness
  • Training: How we train our models after preprocessing (RoBERTa, Electra, T5)
  • TPU-RoBERTa: How we specifically trained Large RoBERTa's weights
  • Submission: Template notebook we used for Assemble.

Insights :

We use ideas from the winning team of Jigsaw multilingual toxic comment classification in order to mitigate the variability. What we do is a soft-voting classifier of multiple RoBERTa-large predictions, the models only differs in the random seed used for initialization of the last layer. In addition, we also make a prediction after each epoch, and average them for the final prediction. This removed the need to pick a best epoch for predictions, which is pretty big.

Datasets:

  • We saved all datasets produced by Preprocessing notebook here
  • The difference between those data are explained in Training notebook (and links for the weights)

Validation strategy :

We did not do any cross-validation, we quickly noticed that a simple hold-out sample was enough to evaluate our models, because the score on the public LB was very close to the one estimated with our hold-out strategy.