./scripts/crawl_case_pdfs.r
:
The Hong Kong government routinely release COVID-19 confirmed case details in PDF format on their websites (e.g. the attachment from this link).
Some of the information were provided exclusively in these PDFs. Hereby I provide a tiny R script for downloading all these PDF attachments from different news reports over the two past years.
./scripts/extract_tables_from_PDF.r
:
To facilitate downstream data wrangling with colleagues, we converted the PDF tables into Excel xlsx files using an open-source OCR software. Specifically, we tried to use the trained model with traditional Chinese (--lang=chinese_cht
).