Blank documents in common crawl data

Question

Blank documents in common crawl data

Closed this issue 7 months ago · 4 comments

Hi, I've been exploring the common crawl data which I downloaded from huggingface, and I noticed there seem to be a lot of rows with blank text.

For example, in data/common-crawl/cc_en_head/cc_en_head-0000.json.gz, I found that ~12.25% of rows had empty text fields.

Query:

select
count(*) as all_rows
, sum(case when text = '' then 1 else 0 end) as null_text
, sum(case when text = '' then 1 else 0 end)/count(*) as ratio
from cc_en_head_0000
where true

Example ID:
http://009.housedems.com/article/dems-call-restoration-stolen-wages-mi-workers

Answer 1 · 2023-08-22T21:10:41.000Z

Also, what is the difference between the three subfolders: cc_en_head, cc_en_middle, cc_en_tail? I couldn't find any information about this in the data sheet. I checked a file in cc_en_head and cc_en_middle and both have blanks.

Answer 2 · 2023-08-25T12:48:33.000Z

Also, what is the difference between the three subfolders: cc_en_head, cc_en_middle, cc_en_tail? I couldn't find any information about this in the data sheet. I checked a file in cc_en_head and cc_en_middle and both have blanks.

You can refer CC-net paper link in section 3.4 & 5.2 about its description. To make it short, CC-net applies 5-gram Kneser-Ney model(which are trained in Wikipedia) to calculate ppl on the commoncrawl corpus. The lower ppl, the more similar to wikipedia and it locates in the headpart.(but even if the document lies on the tail part, it can not be always said that the document is bad since it only means kind of different vocab distribution from wiki)

Answer 3 · 2023-08-27T00:18:19.000Z

Hi @jtalmi, thank you for letting us know about blank documents. Our current mixing code does not remove documents that, after processing, have no tokens left. We are noting it down for future release.

Overall token count shouldn't be impacted, I guess it's just annoying.

Answer 4 · 2024-02-21T23:36:24.000Z

Closing this issue since this was fixed in Dolma 1.6.