/intelcmte_facebookads_extract

Extract text from thousands of pdf documents containing facebook ad info as released by House Intel Cmte. Transform into a structured database

Primary LanguageRMIT LicenseMIT

intelcmte_facebookads_extract

Extract text from pdf documents containing facebook ad info and build structured database

Getting the raw documents

A sample of the documents are included in the pdfs directory. The entire collection of 3,000+ pdf files is available at https://intelligence.house.gov/social-media-content/social-media-advertisements.htm

For those interested in running the extraction code on the entire collection, simply place all of the files found at the link above into the pdfs directory on your local machine. For size considerations of this repo on github, only the sample of about 100 files are included here.