thu-ml/RoboticsDiffusionTransformer

Dataloder thread issues

Closed this issue · 1 comments

Thanks for awesome paper and model! I was experimenting with multi-node training and sometimes (I can't trace the problem) I end up running into:
Error catched when processing sample from viola: The global thread pool has not been initialized.: ThreadPoolBuildError { kind: GlobalPoolAlreadyIniti alized } issues coming from the tokenizer of the llm. Have you seen this error before? I run the producer, fill the buffer and then run the training.

I assume the producer is blocking some threads for the rust tokenizer fast of the T5.

Maybe you can disable the parallelism of the tokenizer:

import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In any case, I would suggest you use the HDF5VLADataset rather than the TensorFlow Dataset. That way, you will not need any producer...