This project contains code for running an analysis of adverse drug events using the Multi-Item Gamma Poisson Shrinker (MGPS) model described in Empirical bayes screening for multi-item associations.
This analysis is designed to be small enough that you can run it on a single machine if you do not have access to a Hadoop cluster. You will need to have a version of CDH3 on your local machine, along with the version of Pig that is compatible with that version.
You will need to have Maven for compiling the Pig user-defined functions, and may also want to have a copy of R and Gephi for certain phases of the analysis.
The input data for this analysis may be downloaded from the FDA's AERS website. You'll need to get the ASCII version of the data files for as many quarters as you would like to run over. For my own analysis, I used the data from 2008 through 2010.
The Pig scripts below assume that the input data is stored in three HDFS directories under the user's home directory: aers/drugs, aers/demos, and aers/reactions. All of the DRUG*.TXT files from the AERS website should go into aers/drugs, all of the DEMO*.TXT files should go into aers/demos, and all of the REAC*.TXT files should go into aers/reactions.
If you have not done so already, load the input data into the Hadoop cluster:
hadoop fs -mkdir aers
hadoop fs -mkdir aers/drugs
hadoop fs -put DRUG*.TXT aers/drugs
hadoop fs -mkdir aers/demos
hadoop fs -put DEMO*.TXT aers/demos
hadoop fs -mkdir aers/reactions
hadoop fs -put REAC*.TXT aers/reactions
Each of these commands should be run from the project's top-level directory, i.e., the directory that contains this README file.
mvn package # Builds the Pig UDFs
pig -f src/main/pig/step1_join_drugs_reactions.pig
pig -f src/main/pig/step2_generate_drug_reaction_counts.pig
pig -f src/main/pig/step3_generate_squashed_distribution.pig
At this point, you can optionally run the R code to solve the MGPS optimization problem. You will need to install the BB library in your local version of R using install.packages("BB") if you do not have it already.
hadoop fs -getmerge aers/drugs2_reacs_stats d2r_stats.csv
Rscript src/main/R/ebgm.R d2r_stats.csv
The output from the optimization run may be plugged into the Pig script that scores the tuples, or you can just use the default parameters that are there now:
pig -f src/main/pig/step4_apply_ebgm.pig
The final output will be in aers/scored_drugs2_reacs. To generate the GEXF file of drug-drug interactions to load into Gephi, run:
hadoop fs -getmerge aers/scored_drugs2_reacs scored_d2r.csv
./src/main/python/gephi.py scored_d2r.csv > drugs.gexf