mathiasleroy/pdfscraping

HTML

pdfscraping

Hackathon exploration of PDF scraping

TASKS

cataloguing existing packages and looking at privacy considerations
evaluation of different tools and approaches a. text based b. image based
investigation of LLMs

a. look at APIs in R, python

b. identify potential LLMs that could be used

investigation of validation pipelines to check extraction quality
investigation of scraping plots (if time allows)