Name: Leila Khaertdinova
Email: l.khaertdinova@innopolis.university
Group number: BS21 DS-02
This project is intended for the first assignment in the Practical Machine Learning and Deep Learning course at Innopolis University.
Text Detoxification Task is a process of transforming the text with toxic style into the text with the same meaning but with neutral style.
The dataset is a subset of the ParaNMT corpus (50M sentence pairs). The filtered ParaNMT-detox corpus (500K sentence pairs). It is already downloaded from here, you can find it in data/raw
folder. This is the main dataset for this assignment detoxification task.
To run this project, run the following commands in the repo root directory:
-
Create the virtual environment
python3 -m venv .venv source .venv/bin/activate
-
Install the required dependencies:
pip install -r requirements.txt
-
Download the English language model for spaCy
python -m spacy download en_core_web_sm
-
Make sure you have a compatible version of Python 3.9.13 before running the code.
-
For dataset preprocess and creation, run the following command:
python ./src/data/make_dataset.py
You can provide an optional argument
--size <CUT_SIZE>
to specify the size of the data to be processed. -
To train the models, run the following commands:
# for training the pytorch transformer python ./src/models/pytorch_transformer/train.py --batch_size <BATCH_SIZE> --epochs <NUM_EPOCHS> # for fine-tuning the t5 model python ./src/models/t5/train.py --batch_size <BATCH_SIZE> --epochs <NUM_EPOCHS>
You can provide the arguments
<BATCH_SIZE>
,<NUM_EPOCHS>
to specify the batch size and number of epochs for the training. -
To download weights, run the following command:
python ./src/data/download_weights.py
You can also provide
--model <MODEL_NAME>
(t5
ortransfromer
) to download weights for a specified model. -
To get the prediction results on a test set (5000 text examples), run the commands:
# for the pytorch transformer python ./src/models/pytorch_transformer/predict.py # for the t5 model python ./src/models/t5/predict.py
-
To run on your example:
# for the pytorch transformer python ./src/models/pytorch_transformer/predict.py --inference "<YOUR_EXAMPLE>" # for the t5 model python ./src/models/t5/predict.py --inference "<YOUR_EXAMPLE>"
Provide
<YOUR_EXAMPLE>
to make inference on your own example.