allenai/dolma

Blank documents in common crawl data

Closed this issue · 4 comments

jtalmi commented

Hi, I've been exploring the common crawl data which I downloaded from huggingface, and I noticed there seem to be a lot of rows with blank text.

For example, in data/common-crawl/cc_en_head/cc_en_head-0000.json.gz, I found that ~12.25% of rows had empty text fields.

image

Query:

select
count(*) as all_rows
, sum(case when text = '' then 1 else 0 end) as null_text
, sum(case when text = '' then 1 else 0 end)/count(*) as ratio
from cc_en_head_0000
where true 

Example ID:
http://009.housedems.com/article/dems-call-restoration-stolen-wages-mi-workers

jtalmi commented

Also, what is the difference between the three subfolders: cc_en_head, cc_en_middle, cc_en_tail? I couldn't find any information about this in the data sheet. I checked a file in cc_en_head and cc_en_middle and both have blanks.

Also, what is the difference between the three subfolders: cc_en_head, cc_en_middle, cc_en_tail? I couldn't find any information about this in the data sheet. I checked a file in cc_en_head and cc_en_middle and both have blanks.

You can refer CC-net paper link in section 3.4 & 5.2 about its description. To make it short, CC-net applies 5-gram Kneser-Ney model(which are trained in Wikipedia) to calculate ppl on the commoncrawl corpus. The lower ppl, the more similar to wikipedia and it locates in the headpart.(but even if the document lies on the tail part, it can not be always said that the document is bad since it only means kind of different vocab distribution from wiki)

soldni commented

Hi @jtalmi, thank you for letting us know about blank documents. Our current mixing code does not remove documents that, after processing, have no tokens left. We are noting it down for future release.

Overall token count shouldn't be impacted, I guess it's just annoying.

Closing this issue since this was fixed in Dolma 1.6.