November 2016
The FDA (the U.S. Food and Drug Administration) records every application for a new medical product on its online database. These applications include a number of reports detailing the FDA's assessment of the product's safety and suitability for the U.S. market.
There were suspicions that the site was missing a vast number of these reports, however until recent efforts this database (Drugs@FDA) was extremely difficult to navigate and it was hard to tell just how bad the problem was. The British Medical Journal asked me to build a web scraper to crawl through over 22,000 different products in order to get a grasp of the quality of this database.
This project contains all of the code used to scrape the data from the (now retired) Drugs@FDA site and analyse it. I advise reading Drugs@FDA Analysis.ipynb first as it acts as a summary of the project's results.
FDA Spider --> Contains scrapy project for the scraping of drug application data from the Drugs@FDA site, the main spider was FDASpider
masterDrugList2.csv --> The outputted csv file from FDASpider containing the raw drug application data
FDA_Data_Analysis.py --> Analysis of masterDrugList2.csv using pandas
Drugs@FDA Analysis.ipynb --> A jupyter notebook containing a rough summary of the analysis and conclusions