Authors: Xiang Zhang, Marissa Sumathipala, Marinka Zitnik
Paper on Nature Computational Science
We investigate 10,443,476 adverse event reports (involving 19,193 adverse events and 3,624 drugs) from the U.S. Food and Drug Administration (FDA) Adverse Event Reporting System (FAERS) dataset, collected from January 2013 to September 2020. To support safety surveillance of therapeutic products, the FAERS stores anonymized, manually reviewed adverse event reports received by the FDA.
We use the dataset to detect adverse events that are significantly associated with the pandemic, pinpoint clinically relevant drugs strongly connected with adverse drug events, and identify disparities in the distribution of adverse events across sex and age. To this end, we develop an approach that identifies clinically meaningful adverse events that meet three criteria: i) the reporting frequency of the adverse event changed significantly between 2019 and 2020, ii) the change cannot be explained by its trend in previous years (2013 to 2019), and iii) the adverse drug reaction is strongly associated with at least one drug and cannot be explained by drug interference.
-
Step 1
1_rawdata_download.ipynb
: Download raw FAERS data from 2013 Q1 to 2020 Q3 (31 quarters). We provide bash commands to efficiently download the quarterly dataset. SeeXML_NTS.pdf
(provided in Dataset) for a better understanding of data structure and description. -
Step 2
2_parsing_preprocessing.ipynb
: Extract key information from FAERS raw XML data, including administration (version, report_id, case_id, country, qualify, serious, serious subtype list, receivedate, receiptdate), demographic (age, gender, weight), adverse events list (each element represents an adverse drug reaction including PT_code, PT, and outcome), and drug list (each element represents a drug, the key information is drug substance)). Some reports are incomplete, we fill a fixed number for missing values (more details are introduced in script). The files are parsed in quarter and then merged. We provide the curated and processed data at Dataset. -
Step 3
3_generate_population_cohort.ipynb
: Load processed dataset and narrow down based on reporting country (only keep reports occurred in the US) and reporter's qualification (only keep reports submitted by healthcare professionals). In order to conduct population-scale analysis, we generate cohorts for different populations: all patients, male, female, young (age:1-19), adult (20-65), and elderly (>65). -
Step 4
4_disproportionality_estimation.ipynb
: This script corresponds to the disproportionality estimation as introduced in the online methods in our paper. For each population, we conduct disproportionality estimation on each adverse event to examine the association between the adverse event and pandemic (March 11–September 30, 2020) in contrast to before the pandemic (March 11–September 30, 2019). We here use disproportionality analysis in a novel way that quantifying the association between adverse events and their submission periods. We calculate the significance values using Fisher’s exact test followed by the Bonferroni correction for multiple hypothesis testing. We keep only the adverse events, which pass both the significance test (adjusted p-value < 0.05) and the ROR criterion. -
Step 5
5_AE_trajectories.ipynb
: Implement AE reporting trajectories as described in our paper. For each adverse event, we calculate its incidence proportion from 2013 to 2019 and use 2-order auto-regressive regression to model its trajectory. We then define a dedicated pandemic-adverse event association index (PAEAI) to measure whether a medication’s incidence during the pandemic conforms to its trajectory. We only feed adverse events with positive PAEAI to the next step of our approach. In this file, we also map adverse events from Preferred Terms (PTs) level to System Organ Classes (SOCs) level, based on etiology (such as infections and infestations) and manifestation sites (such as gastrointestinal disorders) following the MedDRA hierarchy. -
Step 6
6_remove_drug_interference.ipynb
: Correspond to the step of drug interference in our approach. In this file, we check two aspects of the association among drugs, adverse events, and the pandemic. First, the adverse event (such as hallucination) should be significantly associated with the therapy of at least one drug (like Pimavanserin). Second, the formed drug-adverse event pair (like Pimavanserin-hallucination) should be significantly associated with the pandemic. Note: 1) exchange the order of the above two moves will not change the results. 2)this step is kind of computational expensive as containing nest loops.
- Step 7
7_results_analysis.ipynb
: This file mainly shows the results of our analysis, including the high-level comparison of different populations, enriched adverse events in each population, difference and common SOC of populations, significant drug-adverse event pairs in each population (both overrepresentation and underrepresentation), etc.
Our scripts are run in O2 cluster which is a platform for Linux-based high-performance computing at Harvard Medical School. The source codes are tested on Linux with Python 3.7, however, our scripts have no requirement of operating systems.
- We recommend the user run scripts in a virtual environment. Build a virtual environment named patientsafety and install Jupyter Notebook if you don't have it:
virtualenv patientsafety
source patientsafety/bin/activate
pip install jupyter
The installation will only takes a few minutes.
- Download our scripts to your local server/computer. Then creat a folder (such as
Code
) and copy all scripts in:
git clone https://github.com/mims-harvard/patient-safety.git
cd patient-safety
mkdir Code
mv *.ipynb Code/
- Create a new folder named
Data
at the same level asCode
. Then create subfolders:curated
for all necessary data to run our scripts including patient safety dataset and supporting files (such as drug and adverse event ontology);pandemic
to save intermediate results; a subfolder ofpandemic
calledresults
to save final analyze results;parsed
for datasets parsed from raw data through2_parsing_preprocessing.ipynb
. You can do that in a terminal as following
mkdir Data
mkdir Data/curated
mkdir Data/pandemic/
mkdir Data/pandemic/results
mkdir Data/parsed
-
Next we need to prepare the curated dataset and mapping dictionary (for drugs and adverse events). The necessary dataset to reproduce our work are processed patient safety reports (
patient_safety.pk
), drug mapping files (AE_mapping.pk
,AE_dic.pk
), and adverse event mapping (drug_mapping.pk
). Download them from here or here, and move data to folderData/curated/
. If your data path is different, please remember to revise file path accordingly. Note: we also provide a corresponding .csv version for all the necessary data. Find them in the same link. -
Before running the codes, please make sure you have configured the required environment. The dependencies and versions are listed in
requirements.txt
. All the necessary packages can be installed using the following command
pip install -r requirements.txt
- Open Jupyter Notebook and enjoy your analysis:
cd Code
jupyter notebook
About running time, 2_parsing_preprocessing.ipynb
, 3_generate_population_cohort.ipynb
, and 6_remove_drug_interference.ipynb
will take a rather long time (like a few hours) due to large-scale dataset and nest loops; the other scripts are much faster (such as ten minutes).
Our codes (*.ipynb
) are written in Jupyter Notebook. It is very easy to convert codes to python scripts. Take 3_generate_population_cohort.ipynb
as an example, run the following command in your terminal,
jupyter nbconvert --to script 3_generate_population_cohort.ipynb
The above command will generate a 3_generate_population_cohort.py
which can be directly run in python IDEs (such as PyCharm) or in terminal.
python 3_generate_population_cohort.py
The users can easily modify 3_generate_population_cohort.ipynb to generate their interested cohorts as a function of reporting time, age, sex, weight, drug, adverse events, etc. For example, our work investigates the period from 03-11
to 09-30
from 2013 to 2020. If anyone wants to analyze a different period, just replace the start or end time: such as replace 09-30
by 12-31
to study the period from March 11 to December 31.
We provide all necessary dataset for reproducing our work, or direct download files here.
In our scripts, data in pickle format are directly used to generate results. However, for preview and broader use, we also supply all required dataset in non-language-specific format (CSV version). See instructions for use these .pk
files here.
-
patient_safety.pk: Processed FAERS dataset (We depicted the procedure of raw data download, data parsing, and preprocessing in
1_rawdata_download.ipynb
and2_parsing_preprocessing.ipynb
). Duplicated reports are already removed. This is a DataFrame where each row denotes an independent report. The rows represent the report version, report ID, case ID, country, qualify, severity, severity subtype 1, severity subtype 2, severity subtype 3, severity subtype 4, severity subtype 5, severity subtype 6, receivedate, receiptdate, age, gender, weight, adverse events list (each element represents an adverse drug reaction including PT_code, PT name, and outcome), and drug list (each element represents a drug, the key information is drug substance). Please find more details about the meaning of each column in2_parsing_preprocessing.ipynb
andXML_NTS.pdf
. -
XML_NTS.pdf: Included in raw data released by FDA. This file describes the data structure in the XML files in detail. Use of the XML file adheres to the original standards documents for E2b and the M2 implementation specifications.
-
AE_mapping.pk: This is a DataFrame stored in
pickle
format. This file contains all adverse events in MedDRA ontology. Each row denotes one adverse event. The columns denote the name and code of the adverse event in different levels: PT (preferred terms), HLT (high-level term), HLGT (high-level group term), and SOC (system organ class). In particular, the columns are PT, PT_name, HLT, HLT_name, HLGT, HLGT_name, SOC, SOC_name, and SOC_abbr (abbreviation of SOC name). -
AE_dic.pk: Extracted adverse event name (PT_name) from AE_mapping.pk and form as a dictionary, which enables fast indexing the adverse event.
-
drug_mapping.pk: Dictionary that maps drug substance from text to DrugBank identifier retrieved from DrugBank Vocabulary. The key of the dictionary is textural name of drug substance, the corresponding value is a list including two elements: DrugBank ID and a number representing the order in which the drug appeared in the FAERS dataset.
- drug_ATC_mapping.csv: Anatomical Therapeutic Chemical (ATC) ontology. The ATC categorization is an internationally accepted classification system maintained by the WHO that classifies active ingredients of drugs according to the organ or system on which they act and their therapeutic, pharmacological, and chemical properties. This file contains the broad class of ATC codes, each subclass level in a separate column, the DrugBank ID, FAERS code ID, and string name.
Note: All data about MedDRA ontology are retrieved from MedDRA through Non-Profit/Non-Commercial License and can only be used in non-profit activities!
- patient_safety.csv, AE_mapping.csv, and *drug_mapping.csv are the same files in CSV format of patient_safety.pk, AE_mapping.pk, and drug_mapping.pk. These CSV files are not used in our scripts but offer opportunities for researchers to study population-scale patient safety in various platforms and programming languages.
We share the intermediate and final results of our analysis.
The following results are organized in the same way: beta
represents the association between adverse events and the pandemic period (p-value is measured by Fisher’s exact test, adjusted by Bonferroni correction). The adverse events are sorted in descending order of PAEAI. The defined PAEAI presents the extent (higher PAEAI suggests larger deviation) that adverse event deviates from its historical trend (Methods). All identified adverse events have positive PAEAI which indicates the reporting pattern during the pandemic doesn’t follow its trajectory. The ‘E’ and ‘P’ in the ‘E\P’ column denotes this adverse event is enriched or purified, respectively.
-
Identified_adverse_events_in_all_patients.csv: Summary of adverse events identified by our approach in all patients.
-
Identified_adverse_events_in_male.csv: Summary of adverse events identified by our approach in male patients.
-
Identified_adverse_events_in_female.csv: Summary of adverse events identified by our approach in female patients.
-
Identified_adverse_events_in_young.csv: Summary of adverse events identified by our approach in young patients.
-
Identified_adverse_events_in_adults.csv: Summary of adverse events identified by our approach in adult patients.
-
Identified_adverse_events_in_elderly.csv: Summary of adverse events identified by our approach in elderly patients.
We present intermediate results during our analysis. The top 100 results (adverse events or drug-AE pairs) the lowest p-value are already reported in supplementary Tables 7-9. Here show the full table of them.
-
Associations_between_adverse_events_and_the_pandemic.csv: Evaluate the association between adverse events and the pandemic period. The
Occur 2019
andNonoccur 2019
denote the submitted reports that contain and do not contain the specific adverse event in 2019 (March 11 -- September 30), respectively. Similarly,Occur 2020
andNonoccur 2020
denote the statistics in the same period of 2020. The p-value is measured by Fisher's exact test followed by Bonferroni correction. Higherbeta
and smaller p-value indicate the adverse event has stronger association with the pandemic. -
Associations_between_adverse_events_and_the_pandemic_Top100.csv: The top 100 adverse events in Associations_between_adverse_events_and_the_pandemic.csv with the lowest p-values. Sorted in descending order of
beta
. Correspond to SI Table 7. -
Associations_between_adverse_events_and_drugs.csv: Evaluate the association between adverse events and drugs during the pandemic period. Contains the number of patients who: exposed to the drug and have the adverse event (denoted by
A
); exposed to the drug but not have the adverse event (denoted byB
); not exposed to the drug but have the adverse event (denoted byC
); not exposed to the drug and not have the adverse event (denoted byD
). All the numbers (A, B, C, and D) are measured during the pandemic period (March 11 -- September 30, 2020). We conduct patient matching and split patients into test group and control group. The size of control group is ten times of that of test group, which means that 10 *(A+B) = C+D (Methods). Highergamma
and smaller p-value indicate the adverse event has stronger association with the drug. -
Associations_between_adverse_events_and_drugs_Top100.csv: The top 100 drug-adverse event pairs of Associations_between_adverse_events_and_drugs.csv with the lowest p-values. Sorted in descending order of
gamma
. Correspond to SI Table 8. -
Associations_between_drug_adverse_event_pairs_and_the_pandemic.csv: Evaluate the association between drug-adverse event pairs and the pandemic. The
Occur 2019
andNonoccur 2019
denote the number of submitted reports that contain and do not contain the specific drug-adverse event pair in 2019 (March 11 -- September 30), respectively. Similarly,Occur 2020
andNonoccur 2020
denote the statistics in the same period of 2020. The p-value is measured by Fisher's exact test followed by Bonferroni correction. Higherdelta
and smaller p-value indicate the pair has stronger connection with the pandemic. -
Associations_between_drug_adverse_event_pairs_and_the_pandemic_Top100.csv: The top 100 drug-adverse event pairs of Associations_between_drug_adverse_event_pairs_and_the_pandemic.csv with the lowest p-values. Sorted in descending order of
delta
. Correspond to SI Table 9.
All files in dataset can be downloaded here. Download portals for separate files:
-
Associations_between_adverse_events_and_the_pandemic_Top100.csv
-
Associations_between_drug_adverse_event_pairs_and_the_pandemic.csv
-
Associations_between_drug_adverse_event_pairs_and_the_pandemic_Top100.csv
Please send any questions you might have about the approach and/or the scripts to xiang_zhang@hms.harvard.edu.
This work is licensed under the MIT License.