Data batching time takes up most of the training time, how to improve it?

Question

Data batching time takes up most of the training time, how to improve it?

nixonjin opened this issue 2 years ago · 2 comments

I find that data batching takes up most of the training time, Have you tried to use Dataloader class to accelerate the data batching? Or say Dataloader is not suitable in this project, because it may be out of memory size when using multiple workers in Dataloader.

Answer 1 · 2022-06-13T09:48:44.000Z

I print out the detailed time partition. Data batching only costs 2% of training time. Actually, the tokenization and word_to_id maping operation are pre-done once during dataset loading. Thus, it should not take that much time theoretically.

But you can also encapsulate the data batching function with a DataLoader and try more workers. I suppose you need to define the collate_fn yourself by following the operations in utils/batch.py.

Answer 2 · 2022-06-14T12:02:23.000Z

Thanks for your reply. There may be some problems with my server, I test the program on another machine and it performs as you report.