pdfscraping

Hackathon exploration of PDF scraping

TASKS

  1. cataloguing existing packages and looking at privacy considerations

  2. evaluation of different tools and approaches a. text based b. image based

  3. investigation of LLMs

a. look at APIs in R, python

b. identify potential LLMs that could be used

  1. investigation of validation pipelines to check extraction quality

  2. investigation of scraping plots (if time allows)