Data Processing and Analysis Code for Results Reporting Trends on ClinicalTrials.gov under FDAAA 2007
Overview
This repository contains everything you need to recreate our analysis published in The Lancet. This code can also be easily adapted for future analyses of interest using ClinicalTrials.gov data.
Data Sources
Each working day we download the full data from ClinicalTrials.gov as part of our FDAAA TrialsTracker project. The data is available in XML format that we convert to JSON strings. We store these in CSV format, delimited by the þ
character for ease of use with tools like BigQuery, however they are also able to be parsed as ndjson files. The code for that downloading and processing is located as part of our TrialsTracker "clinicaltrials-act-converter" repo. Additional code for the FDAAA TrialsTracker is located here.
Adapting the code used to identify applicable trials for the TrialsTracker, we are able to take the raw data of the entirety of ClinicalTrials.gov on a given day and convert it to CSVs with the relevant data necessary for the analysis. Due to their size, the raw "CSV" files that inform this analysis are available separately in an open OSF repository. We are happy to freely share any additional full archives of ClinicalTrials.gov from our database. Please email us at ebmdatalab@phc.ox.ac.uk and we can discuss the best way to get you the data.
Data Processing and Analysis
Raw Data Processing
Each raw data file used for this analysis is processed using the code in the Raw Data Processing
directory. This code takes one of our CSVs of JSON as an input and extracts the necessary data fields to identify ACTs/pACTs and any additional data needed for the analysis to a CSV. The processed data files for this analysis are available both in the Processed CSVs
directory in the Data directory of this repository as well as our OSF page.
STATA Analysis
Similarly, the STATA Analysis
directory contains separate processing code that extracts only the data necessary for the statistical analysis conducted in STATA along with our .do file and additional STATA output and log files.
notebooks
The notebooks
directory contains all the remaining primary analysis code and results for this project in the FDAAA Trends Noteboook - Final.ipynb
notebook.
Figures
All figures from the FDAAA Trends Notebook - Final.ipynb
notebook are available in the Figures
directory in vector (.svg and .eps) formats.
Peer Review Additions
Peer Review Additions
contains some additional statistics and analysis that were added to the paper at the request of peer reviewers.
lib
The lib
directory contains .py files with functions to import for the processing and analysis of the data including lifelines_fix.py
which cosmetically patches the lifelines
module used for the survival analysis to better display at risk counts.
Data
Files necessary for both the raw data processing and the overall analysis:
fdaaa_regulatory_snapshot.csv
is our archive of the old "is_fda_regulated" field from ClinicalTrials.gov used in our pACT identification logic. This data is taken from the 5 January 2017 archive of ClinicalTrials.gov available from the Clinical Trials Transformation Initiative.
qa.csv
is our scrape of QC data used for QC data prior to it being made available in the public XML data.
We also include a folder of the processsed CSV files
Processed CSVs
and a placeholder directory in which you can place the raw data from here
Additional files and directories in the repository are for use with Docker as described below.
How to view the notebooks and use the repository with Docker
The analysis Notebooks live in the notebooks/
folder (with an ipynb
extension). You can most easily view them on nbviewer, though looking at them in Github should also work.
The repository has also been set up to run in Docker to ensure a compatible environment. While the notebook should be able to run in the current directory without Docker (assuming the environment specified in requirements.txt
) you can follow the directions in the Developers.md
file to clone this repository and run any code of interest within a Docker container on your machine.
How to cite
You can cite our Lancet paper for the methods and results of this analysis:
DeVito NJ, Bacon S, Goldacre B. Compliance with legal requirement to report clinical trial results on ClinicalTrials.gov: a cohort study. Lancet 2020; 395: 361–9.
Static DOI: 10.1016/S0140-6736(19)33220-9
You can cite our code directly via Zenodo:
Please note, the version of the repository at this DOI on Zenodo is the version as it stood at publication of the paper. All data and analysis code remains unchanged compared to this repository, however non-analysis portions of the code may have been updated or refactored, the structure of the directory may have changed, and Docker compatibility has been added.