How to avoid the long time waiting before start training?
Opened this issue · 4 comments
Dear developer,
Thanks for the great sentence-transformers library!
I am finetuning the sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 using my own data following the tutorial from: https://sbert.net/docs/sentence_transformer/training_overview.html
I first finetuned it with a toy dataset containing only hundreds of triplet sentence samples, and everything was ok, and the finetuning was very fast.
After that, I finetuned it with the formal big dataset containing 100 million triplet sentence samples. I found that it had to wait a long time (about 60 minutes) to start training. And when the data is bigger, the waiting time is longer.
Specifically:
- It first spent 5 minutes to
Generating train split
. - Then spent 30 minutes to dataset mapping.
- After that, it printed
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
. - and waiting about 60 minutes to start the real training.
During the 60 minutes, I found that the GPU was working but the GPU utilization rate was relatively low (30%) and the GPU memory was not used. What's more, during the 60 minutes, no any log information was printed. Was it doing something like data preparation or tokenization? Could you tell me what was it doing, and how to avoid this long waiting time?
After the 60-minute waiting, it started the real training, and the GPU utilization rate was as high as 80%, and the GPU memory was used around 70GB on H100. What's more, the training progress bar was printing similar as x/y [69:08:34<130:13:54, 1.09it/s]
. So that I knew it was training.
I also have another dataset which is 10 times larger than 100 million triplet sentence samples, I worry that I have to wait days to starting the training if I use the huge dataset.
Could you tell me what was it doing during the 60-minute waiting, and how to avoid this long waiting time?
Thank you very much and look forward to your reply.
Hello!
Apologies for the delay, I have been busy with a release.
I'm sorry to hear that you've been getting such long delays. I'd like to make something clear: this is not the expected behaviour.
Let's dive in:
- It first spent 5 minutes to
Generating train split
. - This originates indatasets
here, and it means that it's downloading and preparing the training data. Not much we can do here, I reckon. - Then spent 30 minutes to dataset mapping. - I actually can't figure out where this originates, but I heavily suspect that it's also caused by
datasets
. A lot ofdatasets
operations are mappings. Quick question: are you using theprompts
argument inSentenceTransformerTrainingArguments
? This does adatasets
operation behind the scenes. - After that, it printed
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
. - I've never had this myself, but I usually use WSL2 or Windows. Interestingly, when I Google this error, I see an issues that mentions hanging indefinitely (https://discuss.huggingface.co/t/dataset-transform-hangs-indefinitely-while-finetuning-the-stable-diffusion-xl/59938/3). Sadly, no resolution, but it's possible that upgrading the Linux Kernel is an option here. - and waiting about 60 minutes to start the real training - This is rather unexpected. For reference, I recently experimented with training a model with 80m training samples, and it took ~22 seconds to load the model, datasets, loss, training arguments, trainer, W&B, and to finish the first training step.
The low-but-not-none GPU utilization could mean that prior to training, you are running a very large evaluator (which don't log anything unless you set your logging level to INFO). A lot of the training scripts run an evaluator prior to training, because the evaluated results will be automatically included in the model card, so you can easily see the effects from before training VS after training.
The low GPU memory usage could mean that the batch size is too low, and the 60 minutes seems to indicate that the evaluator is much too large.
Beyond that, before training officially starts, the ModelCardCallback
is triggered and it will gather information about the training and evaluation datasets. This information will later be included in the model cards. It should not take too long, but this may be a cause.
My recommendations:
- Verify if you're running an
evaluator
before the Trainer, e.g. something like# 6. (Optional) Create an evaluator & evaluate the base model evaluator = NanoBEIREvaluator() evaluator(model)
- Disable the
ModelCardCallback
by disabling theadd_model_card_callback
method. Then you can train withCustomSentenceTransformerTrainer
as normal:from sentence_transformers import SentenceTransformerTrainer class CustomSentenceTransformerTrainer(SentenceTransformerTrainer): def add_model_card_callback(self, *args, **kwargs): pass
- Dump a traceback every e.g. 20 seconds - this allows you to see exactly where the code is that's taking a long time. This would be rather useful I think. I've used it a lot myself when things were "stuck":
# make the traceback dumped periodically - every X seconds import faulthandler faulthandler.dump_traceback_later(20, repeat=True)
- Experiment with
datasets
- I'm a bit surprised that thedatasets
operations are taking this long. It could be indicative that your device has very slow CPUs, which could be a cause as well.
Those are my ideas at this time, perhaps they can help you get over this. I do feel like perhaps it's as simple as "slow CPUs" though! Or perhaps something with Linux? Considering I've not encountered this myself.
- Tom Aarsen
Thank you very much. After profiling, I found that it was the TripletEvaluator that cause the long waiting time to start training. After removing the TripletEvaluator, it started training immediately. Thank you!
I'm very happy to hear that you got it working! It got me a bit concerned 😆
- Tom Aarsen
@tomaarsen Dear Tom,
Just a quick question. Although the TripletEvaluator time was avoided, the Generating train split
and dataset mapping
were still time consuming, especially for the dataset mapping
when using a huge dataset.
As far as I understand, the dataset mapping
is pipeline paralleling with the model training
. Specifically, CPU maps one batch samples, and GPU trains with that batch, and at the same time, CPU maps the next batch samples...
But what i saw was that at the begining, it used 100% CPU to mapping all the datasets, at this time, GPU did nothing. After dataset mapping was done (30 minutes for small datasets, hours for big datasets), CPU utilization was down to 10%, and GPU started training with utilization as high as 80%. Why was that? How to parallel the dataset mapping
and model training
?