Replication of finetuning code
Opened this issue · 17 comments
Hello, I want to try finetuning your model with own data but I have two questions:
- I am trying to replicat eyour finetuning code but if I try finetuning the larger version of FLAN-T5 I run into memory capacity issues. I am just using the wordnet dataset from huggingface, training one epoch with a batch size of 1 and reduced lengths. It appears to not run on multiple nodes. How could I solve this?
- How should I format my data in order to use it for further finetuning?
Thank you for any assistance here.
I suggest giving up on the reproduction, my friend.
Code tastes bitter and Truth goes opaque.
- What exact version of FLAN-T5 are you using, and what fine-tuning parameters? To fine-tune FLAN-T5 Large, 40 GB of GPU RAM should be enough (probably even 24). For the XL version, you'll need more. We did not fine-tune on multiple nodes, but used multiple GPUs on one node to increase the global batch size - it worked fine. To be more precise, using 8 GPUs with 64 GB of RAM each allowed us to fine-tune FLAN-T5 XL with the global batch size 32 (we also set
gradient_accumulation_steps=4
and truncated the maximum input length to 160 tokens). You can probably go even beyond that by using reduced precision. - Our fine-tuning code assumes your training dataset is a tab-separated file with two columns:
examples
anddefinitions
. The validation dataset should be in the same format, of course.
Any other questions are welcome.
I suggest giving up on the reproduction, my friend.
Code tastes bitter and Truth goes opaque.
@jacklanda I am not sure what do you mean by that?
I'm guessing poetry generation 🍷
@akutuzov Thank you for your response. The parameters I have been testing with have been:
--model_name_or_path="google/flan-t5-xl"
--cache_dir="/vilhelm/.cache/"
--do_train
--do_eval
--dataset_name="marksverdhei/wordnet-definitions-en-2021"
--output_dir="/vilhelm/finetune_output/"
--overwrite_output_dir
--evaluation_strategy=epoch
--logging_strategy=epoch
--per_device_train_batch_size=1
--per_device_eval_batch_size=1
--predict_with_generate
--save_total_limit=5
--max_source_length=5
--max_target_length=5
--fp16=True
--num_train_epochs=1
--save_strategy=epoch
--load_best_model_at_end=True
--metric_for_best_model=eval_rouge1
--ddp_find_unused_parameters=False
--optim=adafactor \
I have been running it using 4 32GB V100 gpus at the Puhti supercomputer, on a single node.
@VilhelmHovland I believe the root of your troubles is this line:
--dataset_name="marksverdhei/wordnet-definitions-en-2021"
You are trying to use the Wordnet dataset directly as it is on HF. We didn't try that, and I doubt the fine-tuning script deals with this well. As mentioned before, we fine-tune on tab-separated files with two columns: examples
and definitions
, without directly using the datasets
library. This allows much more flexibility. You should point to the training and validation data files with these arguments:
--train_file ${TRAIN_DATASET} \
--validation_file ${VAL_DATASET} \
(see the example here)
Note that the examples should be already augmented with the instruction prompt ("What is the definition of TARGET_WORD?" or whatever prompt you are using).
@akutuzov I see, thank you. Is the exact data you used available anywhere, or do I need to process the CoDWoE and naacl data?
"naacl data" means datasets from Ishivatari et al 2019, right?
Then yes, you'll have to convert them to the tab-separated format I described above. Same with CoDWoE - it comes as json
files, but it's trivial to convert them to .tsv
.
We did not publish our converted versions, since we felt it would be not polite to re-distribute datasets created by others (simply saved in another format). Again, it should be trivial convert these datasets to .tsv
and add the instruction prompt.
If you encounter any difficulties with that, get in touch with me, I'll share our preprocessed files privately.
Hello again, I have now changed my data, but I am still getting the same error. I am using the same parameters except with direct data files. I formatted them like this, in .tsv files, does it look correct? What else could be causing issues?
example definition
cranial pressure What is the definition of cranial? of or relating to the cranium which encloses the brain
an easy job What is the definition of easy? posing no difficulty
@VilhelmHovland did you try to fine-tune a smaller model (flan-t5-base
, foe example), and/or removing the --fp16=True
argument?
@VilhelmHovland I've just tried to fine-tune the flan-t5-base
model on the few lines you quoted above. I repeated them multiple times, so that in the end I got a file with 12 instances (the file is here).
On this toy dataset, fine-tuning with batch size 4 and 2 epochs completed without any issues. I used one A100 GPU with 40GB of RAM. Here is the exact command:
python3 finetune_flan.py \
--model_name_or_path google/flan-t5-base \
--do_train \
--do_eval \
--train_file example_dataset.tsv \
--validation_file example_dataset.tsv \
--output_dir test_model \
--overwrite_output_dir \
--evaluation_strategy=epoch \
--logging_strategy=epoch \
--per_device_train_batch_size=4 \
--per_device_eval_batch_size=4 \
--predict_with_generate \
--save_total_limit=5 \
--max_source_length=192 \
--max_target_length=128 \
--bf16=False \
--num_train_epochs=2 \
--save_strategy=epoch \
--load_best_model_at_end=True \
--metric_for_best_model=eval_rouge1 \
--ddp_find_unused_parameters=False \
--optim=adafactor \
--report_to=none \
Okay, I tried as well, it does work now, thank you. What would be the bottleneck for finetuning the larger models then? Is there any way I could get it to work for those as well?
Well, the usual procedure: set the per-device batch size to 1, and then increase it until you hit out-of-memory error again. This will be your ceiling in terms of RAM. Often, you can increase the batch size even more by using gradient accumulation (at the cost of slower training).
Using more than one GPU (within one node) will also naturally allow you to have a larger global batch size, which is usually a good thing.
@akutuzov Hello, thank you very much for the help earlier, I was hoping you would give me some more advice, I am still working with this model; the model seems to not be learning, and from what I can see in the logging the loss starts and stays at 0.0 (though the logging seems to be very limited and only showing a single epoch), so I suspect the issue is still with the training data. Attached is the batch script and a data sample (in tsv format).
Hi @VilhelmHovland
Are you trying to fine-tune the already fine-tuned model ltg/flan-t5-definition-en-base
? Where do the examples in your data file come from? If they come from CoDWoE, WordNet or Oxford, then no wonder the loss is zero: the model has already seen these definitions during our fine-tuning.
Otherwise, your SLURM script and data look good (of course I hope in reality you train on more examples, since in the attached file I see only 2, so it won't work with batch size 4 specified in the SLURM script).
I am fine-tuning the already fine-tuned model yes; using definitions from the Historical Thesaurus of English from the Oxford English Dictionary, I was expecting very high loss. That was just sample data to show the structure yes, the dataset I have is fairly large
Try to remove --fp16=True
from the arguments of your run script.
Mixed precision can cause problems sometimes.