PDFs download
Opened this issue · 6 comments
Hi,
thanks for sharing this project. Will the actual PDF dataset be made available as well? Or is there any other way to avoid rerunning the whole pipeline again?
Best,
Malte
I think pipeline mentioned in paper is not provided (yet, 🤞 maybe it will be provided by authors in some time?)
For now the final index and a script to download urls that made the cut after running the pipeline on MAY-2022 CC dataset is provided.
May be of interest? https://pdfa.org/new-large-scale-pdf-corpus-now-publicly-available/
Thanks @tballison! This is definitely of interest.
@malteos @tballison want to join forces and make an open source replication of CCpdf pipeline?
Always happy to collaborate!
We shared all the data and the code we could while being compliant to our company data policy. Personally I keep my fingers crossed to your open source replication of the pipeline (I hope the paper will be useful for you)!
I keep this thread open for future discussions on pipeline replication/access to PDFs from other crawls.
Apologies for the delay. Have started work in repo: https://github.com/SushantDaga/ThePDFCorpus to replicate CC-PDF pipeline and results.
Any contribution will be greatly appreciated :)