fda-mining: A Python repository from polymerchm

This Repository is an attempt to "deep" scrape the FDA CDRH website for keywords at the Summary Document Level

It uses openFDA to get 510(k) numbers, and then the CDRH website and BeautifulSoup library to download individual PDFs of the summary documents.

Once downloaded, the PDFs are converted to images and using the tesseract library, converted to searchable text

Text searches are from a keyword list using the fuzzywuzzy fuzzy search library

Current version is a single instance: slow and serial.

Next phase willl parallelize to speed things up some.

polymerchm/fda-mining