Dataset preparation: tag-and-generate/tagger-generator/tag-and-generate-data-prep
Training, inference, evaluation: tag-and-generate/tagger-generator/tag-and-generate-train
We will now present an example of training the politeness transfer system from scratch. The process has five steps:
- Step 1: Getting the code
- Step 2: Getting the training data
- Step 3: Preparing parallel data for training
- Step 4: Training the tagger and generator
- Step 5: Running inference
We begin by cloning this repo:
git clone https://github.com/tag-and-generate/tagger-generator.git
The cloned folder contains: i) tag-and-generate-data-prep
the codebase used for creating the parallel tag and generate dataset, and ii) tag-and-generate-train
, the training code.
Each of these folders has a requirements.txt
file that can be used to download the dependencies.
Next, let's create a folder inside tagger-generator
to save all the datasets/tags:
cd tagger-generator
mkdir data
The training data in a ready to use format is located here.
Download the zip file to the data
folder created above and extract politeness.tsv
.
unzip politeness_processed.zip
head politeness.tsv
txt | style | split |
---|---|---|
forwarded by tana jones / hou / ect on 09/28/2000 | P_2 | train |
the clickpaper approvals for 9/27/00 are attached below . | P_7 | train |
"hello everyone : please let me know if you have a subscription to "" telerate "" ?" | P_7 | train |
we are being billed for this service and i do not know who is using it . | P_0 | train |
As we can see, the data is in the tsv format and has the right header.
You can also use gdown
to directly download the file:
gdown --id 1E9GHwmVM9DL9-KiaIaG5lm_oagLWe908
Now that we have the codebase and the dataset, let's start by creating the parallel data required for training the models. Let's do a listing of the folder so far to make sure we are on the same page:
(dl) tutorial@sa:~/tagger-generator$ ls
data LICENSE README.md tag-and-generate-data-prep tag-and-generate-train
So, we are in the repo (tagger-generator), and see the two code folders (tag-and-generate-data-prep
and tag-and-generate-train
), as well as the data folder (data
).
Further, the data folder has the politeness.tsv
file that we just downloaded:
(dl) tutorial@sa:~/tagger-generator$ ls data/
politeness_processed.zip politeness.tsv
We prepare the parallel data using tag-and-generate-data-prep
:
cd tag-and-generate-data-prep
python src/run.py --data_pth ../data/politeness.tsv --outpath ../data/ --style_0_label P_9 --style_1_label P_0 --is_unimodal True
More details on these options are located in tag-and-generate/tagger-generator/tag-and-generate-data-prep. In summary, we specify the input file, the label for the style of interest (P_9
) and a neutral/contrastive style (P_0
). Importantly, we specify --is_unimodal True
. This option ensures that the parallel data is created as per the unimodal style setting (Figure 3 in the paper).
After data-prep finishes, we see several files in ../data/
.
The important files are described below:
- P_9_tags.json: these are the politeness tags or phrases identified as polite phrases:
"thank"
"looking forward"
"glad"
"be interested"
-
The data prep code creates two sets of training files: one for the
tagger
and another for thegenerator
. To understand these, let's take a sample sentenceplease get back to me if you have any additional concerns .
and look at how it is represented in different files:entagged_parallel.train.en
(input to the tagger):back to me have concerns .
entagged_parallel.train.tagged
(output of the tagger):[P_90] back to me [P_91] have [P_92] concerns .
engenerated_parallel.train.en
(input to the generator):[P_90] back to me [P_91] have [P_92] concerns .
engenerated_parallel.train.generated
(output of the generator)please get back to me if you have any additional concerns .
Here,
P_9
is the style tag, and the number after the style tag captures the position of the tag in the sentence.
With the data files ready, we are ready to run training.
All the training and inference related scripts/code is present in tag-and-generate-train
, so let's cd
to it.
cd tag-and-generate-train
In order to prepare the files for training, we first process them using BPE.
bash scripts/prepare_bpe.sh tagged ../data/
bash scripts/prepare_bpe.sh generated ../data/
We can now start training the tagger and generator:
nohup bash scripts/train_tagger.sh tagged politeness ../data/ > tagger.log &
nohup bash scripts/train_generator.sh generated politeness ../data/ > generator.log &
politeness
is a user-defined handle that we will use during inference.
After the training finishes, the best models (given by validation perplexity) are stored in models
:
(dl) tutorial@sa:~/tagger-generator/tag-and-generate-train$ ls models/politeness/bpe/
en-generated-generator.pt en-tagged-tagger.pt
For our run, at the end of 5 epochs, the validation perplexity was 1.26 for the tagger, and 1.76 for the generator.
Let's test out the trained models on some sample sentences:
(dl) tutorial@sa:~/tagger-generator/tag-and-generate-train$ cat > input.txt
send me the text files.
look into this issue.
bash scripts/inference.sh input.txt sample tagged generated politeness P_9 P_9 ../data/ 3
Here sample
is a unique identifier for the inference job, and politeness
is the identifier we used for the training job. P_9
is the style tag (kept the same for unimodal jobs). (Please see the README at tag-and-generate/tagger-generator/tag-and-generate-train for more details).
The final and intermediate outputs are located in experiments folder:
(dl) tutorial@sa:~/tagger-generator/tag-and-generate-train$ ls experiments/sample_*
experiments/sample_generator_input experiments/sample_tagged
experiments/sample_output experiments/sample_tagger_input
Let's look at the final output:
(dl) tutorial@sa:~/tagger-generator/tag-and-generate-train$ cat experiments/sample_output
please send me the text files.
we would like to look into this issue.
Not bad!
We hope this walkthrough is helpful in understanding and using the codebase. Here are some additional helpful links: