allenai/dolma

A Question about the meaning of dolma_v1.6_cc_en

Closed this issue · 1 comments

Hello, I found that the naming of the dolma_v1.6_cc_en includes cc_en_head,cc_en_middle and cc_en_tail. What do these names mean?

Hi @aleien95,

Names refer to buckets in which the CCNet pipeline organizes documents extracted from common crawl. The CCNet pipeline estimates how similar documents are to wikipedia pages using a KenLM statistical language model. Documents that are highly similar are placed in cc_en_head, followed by cc_en_middle and cc_en_tail.

We retain the same layout out of convenience.

Hope this helps! Feel free to reopen this issue if you have more questions.

Best,
Luca