/CCpdf

Index of URLs to pdf files all over the internet and scripts

Primary LanguageShellMIT LicenseMIT

Data and scripts accompanying CCpdf paper

This repository contains data and simple scripts accompanying the "CCpdf: Building a High Quality Corpus for Visually Rich Documents from Web Crawl Data" paper.

The data represented here is a subset of data made public by the Common Crawl organization, see https://commoncrawl.org/2022/06/may-2022-crawl-archive-now-available/

Files

  • ccpdf.tsv — metadata of CCpdf files
  • run.sh — main script for downloading CCpdf files from publicly available sources
  • download-from-crawl.sh — script for the actual downloading