Extremely large RAM consumption by create_pretraining.py
sultanovazamat opened this issue · 3 comments
Hi, everyone!
I am trying to train Albert from scratch using the multilingual dataset, which is ~40GB.
I have trained the SentencePiece Model without any problems, but when I try to launch the create_pretraining.py script it consumes an extremely large amount of RAM, even 1TB is not enough.
So the question is how much memory does it require?
And maybe the reason is related to the presence of non-Latin languages in the dataset?
Thanks!
We also ran into this issue as well. We solved it by splitting the corpus into ~ 100MB files and then running them as separate processes through a caller script that ran one process per CPU on our system. Each process used ~ 4 GB of memory.
We found https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor to be useful to manage running one process right after another with a pool size of the number of CPUs.
We also ran into this issue as well. We solved it by splitting the corpus into ~ 100MB files and then running them as separate processes through a caller script that ran one process per CPU on our system. Each process used ~ 4 GB of memory.
We found https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor to be useful to manage running one process right after another with a pool size of the number of CPUs.
Hi! @jeisinge
Thanks for the great solution!
Could you please exact time spent on processing one chunk (~100mb)?
We are taking about 25 minutes for one file --- obviously, this is a embarassingly paralleizable task.
Also, we adjust the parameters a bit due to having a very large corpus. We felt that we didn't need to augment our data. It is not clear, yet, if this was a good idea. The command we are running looks like:
python -m create_pretraining_data
--input_file=/in/part-00033-tid-2788136423935398351-765d3136-2064-43bd-831b-ed3e65a30183-5151-1-c000.txt
--output_file=/out/part-00033-tid-2788136423935398351-765d3136-2064-43bd-831b-ed3e65a30183-5151-1-c000.txt.tfrecord
--vocab_file=/albert/assets/30k-clean.vocab
--spm_model_file=/albert/assets/30k-clean.model
--max_seq_length=256
--dupe_factor=1
--masked_lm_prob=0.15
--max_predictions_per_seq=38
The most important aspects for data size, I believe, are dupe_factor
and max_seq_length
.