Chuweb21D: A Deduped English Document Collection for Web Search Tasks

The home page for the public data collections: Chuweb21 and Chuweb21D. You can refer to the paper Chuweb21D: A Deduped English Document Collection for Web Search Tasks (release soon) for more details of these two collections.

Background

Chuweb21D is a English document collection deduped from Chuweb21 that was used as the target corpus for the NTCIR-16 WWW-4 task. Our motivation for deduping Chuweb21 is based on our experience as the organisers of the NTCIR task: when conducting relevance assessments of the pooled documents for the task, we witnessed considerable amount of duplicate and near-duplicate web pages, which can potentially cause problems. For deduping, we employ the Simhash strategy with two different clustering thresholds (Hamming distance $\tau \le 2$ and $\tau \le 3$), and release two versions of Chuweb21D; the smaller collection (Chuweb21D-60) will be used as the target corpus for the upcoming NTCIR-17 FairWeb-1 (Group-Fair Web Search) task.

The following figure summarizes the construction procedures of the Chuweb21 and Chuweb21D collection:

Basic statistics

The following table shows the comparisons between some popular English document collections and our Chuweb21(B) datasets.

Name	#Docs	Space	Collected time	Format
TREC GOV2	25M	80GB	Early 2004	raw html
MS MACRO Docs	3.2M	22GB	Before Jan 2017	text (title; body)
Tensorflow c4	/	750GB	Apr 2019	text
ClueWeb09	1.04B	5TB	Jan 2009~Feb 2009	raw html
ClueWeb12	733M	5.54TB	Feb 2012~May 2012	raw html
ClueWeb22	10B	/	Before Aug 2022	raw html
Chuweb21	82.5M	1.7TB	Apr 2021	raw html
Chuweb21D-60	49.8M	1.2TB	Apr 2021	raw html
Chuweb21D-70	57.9M	964GB	Apr 2021	raw html

How to download

We have already uploaded Chuweb21 and Chuweb21D-60 to the web disk (Chuweb21D-70 is coming soon!), you can download the data through the shared link, which is listed below:

Dataset	Link (Chinese mainland)	Link (others)	Md5 code
Chuweb21	Baidu Cloud (code: t828)	Google Drive	md5.txt
Chuweb21D-60	Baidu Cloud (code: a6j2)	TeraBox (code: wtsh)	md5.txt
Chuweb21D-70	Baidu Cloud (code: v6xh)	Part1: TeraBox (code: 75im) Part2: TeraBox (code: spq8)	md5.txt

In addition to web disk download, we also support both hard disk shipping (only for users in Chinese mainland) and server SCP for data delivery. You can contact us via e-mail (thuir_datamanage@126.com) if needed.

Dataset structure

Both Chuweb21 and Chuweb21D collections are organized as below:

data
├── CC-MAIN-XXX     // eight folders (named as time interval), each contains 640 (or 639) warc.gz files                      
│   ├── CC-MAIN-XXX-00000.warc.gz     // warc.gz file which compresses plently of HTML documents
│   └── ... ...
│   └── CC-MAIN-XXX-00639.warc.gz
├── md5_checksum.txt     // md5 codes for each warc.gz

8 directories, 5119 files

The HTML documents are organized with "warc.gz" format (an about 70MB sample file: sample.warc.gz). Here we provide a sample Python script to read the "warc.gz" file:

import warc # pip install warc3-wet
import traceback
with warc.open("sample.warc.gz") as f:
    for record in f:
        try:
            url = record['WARC-Target-URI'] # html url
            uid = record['WARC-RECORD-ID']
            uid = uid.replace("<urn:uuid:", "").replace(">", "") # html doc id
            content = record.payload.read() # html content
        except:
            traceback.print_exc()

Authors

E-mail: thuir_datamanage@126.com

Zhumin Chu (Tsinghua University, P.R.C.)

Tetsuya Sakai (Waseda University, Japan)

Qingyao Ai (Tsinghua University, P.R.C.)