PDFs download

Question

PDFs download

Opened this issue 2 years ago · 6 comments

Hi,

thanks for sharing this project. Will the actual PDF dataset be made available as well? Or is there any other way to avoid rerunning the whole pipeline again?

Best,
Malte

Answer 1 · 2023-05-01T20:51:20.000Z

I think pipeline mentioned in paper is not provided (yet, 🤞 maybe it will be provided by authors in some time?)

For now the final index and a script to download urls that made the cut after running the pipeline on MAY-2022 CC dataset is provided.

Answer 2 · 2023-05-23T13:51:10.000Z

May be of interest? https://pdfa.org/new-large-scale-pdf-corpus-now-publicly-available/

Answer 3 · 2023-05-26T16:50:53.000Z

Thanks @tballison! This is definitely of interest.

@malteos @tballison want to join forces and make an open source replication of CCpdf pipeline?

Answer 4 · 2023-05-26T17:59:08.000Z

Always happy to collaborate!

Answer 5 · 2023-06-05T08:09:46.000Z

We shared all the data and the code we could while being compliant to our company data policy. Personally I keep my fingers crossed to your open source replication of the pipeline (I hope the paper will be useful for you)!

I keep this thread open for future discussions on pipeline replication/access to PDFs from other crawls.

Answer 6 · 2023-06-08T21:15:44.000Z

Apologies for the delay. Have started work in repo: https://github.com/SushantDaga/ThePDFCorpus to replicate CC-PDF pipeline and results.

Any contribution will be greatly appreciated :)