r-three/common-pile

Efficient Reshard Tool

Opened this issue · 2 comments

Some of the preprocessing can remove all document content (and some documents in some sources are blank to begin with). We should have an efficient way to remove these documents (there is an example in the dolma docs) and reshard them to that the resulting shards are balanced.

Took a glance at this to try to familiarize myself with the Dolma library. Would an efficient solution be to run dolma tag to tag with the char_length_v1 tag, then run dolma mix to exclude docs for which this tag value is 0, to create a filtered copy?

Or would it be preferred to run this in-place?--If no and the above is ok, I could implement this ASAP. if the latter, can look some more at how dolma does (re)sharding.

EDIT: ahh, sorry, just found https://github.com/allenai/dolma/blob/main/scripts/remove_empty_docs.py , taking a look at that now.

Example where this would be useful, https://huggingface.co/datasets/blester125/foodista-dolma/tree/main/v0/documents

For the foodista data, the raw html is first saved into the dolma format and the contents is parsed out with a dolma parallel processor. This results in "text" fields that are much smaller and the resulting dolma shards aren't really worth being shards lol