NeuCLIR Collection 1

If you are registered as a participant of the TREC NeuCLIR Track 2022, you can download collection from the link provided on this page. Note that you will need trec2022 credentials to download this file (provided when you register for TREC).

This repository contains the scripts for downloading and validating scripts for the documents. Document ids are stored in resource/{lang}/ids.*.jsonl.gz

Required packages for the scripts are recorded in requirements.txt.

We recommand creating a new python environment for downloading. Package versions could have some unintentional effect on decoding the documents from Common Crawl. Please contact us if you have documents with mismatch hashs.

Download Documents

To download the documents from Common Crawl, please use the following command. If you plan to use this collection with ir_datasets, please specify ~/.ir_datasets/neuclir/1 as the storage or make a soft link to to the directory you wish to store the documents. The document ids and hashs are stored in resource/{lang}/ids.*.jsonl.gz.

python download_documents.py --storage ./data/ \
                             --zho ./resource/zho/ids.*.jsonl.gz \
                             --fas ./resource/fas/ids.*.jsonl.gz \
                             --rus ./resource/rus/ids.*.jsonl.gz \
                             --jobs 4 --check_hash

If you wish to only download the documents for one language, just specify the id file for the language you wish to download. In case the URLs for the Common Crawl files change in the future, the flag --cc_base_url provides the options to specify an alternative URL for the files. The current default value points to https://data.commoncrawl.org/. The full description of the arguments can be found when execute with the --help flag.

Postprocessing of the Downloaded Documents

Multiprocessing during download results in arbitrary ordering of the documents in the saved .jsonl files. To support full reproducibility, we provide script to postprocess the file to match the document order specified in the document id files. fix_document_order.py changes the ordering of the documents, validates the document hashs, and verifies all and only specified documents are in the result file. The unsorted file will be renamed as docs.jsonl.bak. You could delete the file manually. Following is a sample command.

python fix_document_order.py --raw_download_file ./data/{lang}/docs.jsonl \
                             --id_file ./resource/{lang}/ids.*.jsonl.gz \
                             --check_hash

If the script identifies missing files during postprocessing, please rerun the downloading script with --resume flag to get the missing documents. Some files might be missing due to temporary network failure or connection refused by the Common Crawl servers. Rerunning the downloading script usually would be able to retrieve those documents. If not, please raise issue with the document id to bring this to our attention.

Converting the Chinese Character Sets

The Chinese document collection contains both traditional and simplified Chinese characters from their original source. We provide a script for converting the documents to either traditional or simplified characters if the users would like to have an unified character set. The following command creates a docs.{traditional, simplified}.jsonl file in the same directory as the original docs.jsonl file.

python convert_chinese_char.py --document_file ./data/zho/docs.jsonl \
                               --convert_to {traditional, simplified} \
                               --line_count 3179209

eugene-yang/download-neuclir1-collection

NeuCLIR Collection 1

Download Documents

Postprocessing of the Downloaded Documents

Converting the Chinese Character Sets