allenai/dolma

Issue with ring tokenizer

Opened this issue · 0 comments

This line seems to throw an error when ring_size < len(source_paths) for division by 0.

Basically it seems that len(tokenizer_ring) will be decremented here. The inner loop is broken, but the outer loop keeps going and divides by 0.

I'm not exactly sure what the right fix is, and it seems things work fine as long as ring_size * processes >= num_files. Any clarity here would be appreciated, thanks!