/pmldl-detox.prj

Primary LanguageJupyter Notebook

[PMLDL] Detoxify

Author: Igor Alentev

Group: BS20-RO-01

Email: i.alentev@innopolis.university

Problem Definition

Can be found here

Installation and Running

Instead of requirenments.txt this repository uses conda environment. Read further.

I am proud owner of the AMD Graphics powered laptop (god bless apple), as a result it is nearly impossible for me to run or test anything locally. In general everything should be fine, but I was unable to test if everything runs as expected. So several issues might be possible. But I did my best to avoid any inconsistencies across the code.

  • All predictions preprocessed and saved locally
  • All metrics precalculated and saved locally
  • All datasets precomputed and saved here and for toxic words here
  • Colab notebooks rewritten locally
  • Dotenv tuned properly
  • Dependencies across the src files as well as notebooks should work
  • Checkpoints provided
  • conda environment exported to environment.yml

For instance, I would recommend not running tuning and learning, rather than loading the checkpoints, which is indeed works (afaik).

Checkpoints

It was a hard decision, but I have decided to store model checkpoints along the project itself. So if you will clone the repo, you will have to clone 0.5GB of checkpoints as well. However, it is very handy, since they are not so heavy, but useful all over the work.

Notebooks

Reports

Main hypothesis, ideas and related information. The draft of the project

Final report, containing all the necessary information about the models, data retrieval and preprocessing, fine-tuning and evaluation

Acknowledgements

Please do not blame me if anything does not work. I did my best to seemlessly integrate everything with each other and spent many hours on this. I am aiming at flipped class, so I will be very sad if I will get bad mark because of some minor issue. Even though I am all in for fair assessment and open for discussion of real issues with the work.

  • Vladimir Ivanov for informative lectures
  • Maxim Evgrafov and Lada Morozova for incredibly useful labs
  • Skolkovo for work on detoxification
  • Skolkovo for another work on detoxification
  • This work for great showcase of metrics and transformers
  • ParaNMT-50M dataset creators
  • Detoxify creators
  • WordNet creators
  • Yeah Yeah Yeahs for a great music