Alex Giarrocco | Sonia Kopel | Neal Sakash |
- Source Material
- Project Details
- Results
- Introduction
- Data Used
- Technologies Used
- Scrum Timeline
- Variable Correlations
- Hospitalization Model
- Behavior Models
- Challenges and Triumphs
- Medical Expenditure Panel Survey
- 2015 Full Year Consolidated Data File
- 2015 Person Round Plan Public Use File
Benefitfocus Deliverable:
"The team will need to explore models for computing the impact of various medical benefit plan attributes
(deductible, coinsurance, copay, etc.) on healthcare utilization and personal well-being. The team will need to
research and define what is meant by personal well-being. Using those models, the team will build an interactive
web app enabling a user to visualize the relationships between medical plan design values and healthcare
utilization and well-being. The students will have the freedom to use modeling techniques, programing languages,
and frameworks of their own choice."
With the guidance of Benefitfocus, our team set out to find a link between health plan design, patient behavior, and healthcare utilization. Due to the single semester time constraint we decided to narrow the scope of the project to only focus on utilization instead of personal well-being. We chose the most costly healthcare expenditure, Inpatient Hospitalization, as our primary target variable and were successful with building a classification model that predicted this utilization from patient behavior with accuracy above 88%.
Healthcare spending is one the highest consumer costs in the United States. At $3.3 trillion, it accounted for almost a fifth of the country's GDP in 2016. This share has nearly doubled from 30 years ago. Relative to a 23% consumer price index increase from 2005, 2014 healthcare costs rose 15 percentage points. Of the all major goods and services, only child-care and higher education have experienced price increases faster than healthcare. Externalities from this increased spending can contribute to slow wage growth, temporary or part-time employment, outsourcing, a reduced to care, and personal bankruptcies.
Navigating modern healthcare policy has become increasingly daunting for both employers and employees. With employers strained by rising costs, employees are now expected to take greater responsibility with choosing, managing, and paying for their coverage.
Hospitalizations have historically been the largest expense in healthcare, accounting for nearly a third of all costs in 2016. If the rate of hospitalizations were to decrease, savings from the reduced spending could be passed on to employer and employee premiums.
Benefitfocus seeks to mediate this expenditure and has partnered with the College of Charleston to help predict inpatient hospitalizations. Our team has built a model to predict this expenditure and have found a link between a patient's behavior and their chances of being hospitalized. We further hoped to determine if this behavioral link could be an extension of the patient's plan design.
For our predictive model we used data from the publicly available and federally administered Medical Expenditure Panel Survey. The specific dataset used came from the 2015 consolidated survey of families and individuals, their medical providers, and employers across the US. The dataset includes specific health services used, how frequently they were used, the cost of these services, and how they were paid for. From this dataset we were able to parse out our predictor, control, and target variables relating to plan design, patient's behavior, and hospitalizations. The MEPS survey has influenced every major US healthcare policy decision since its inception in 1996.
R was out primary programming language used for project's statistical computation and modeling. RStudio was our R IDE, using dplyr to streamline data manipulation and ggplot2 for data visualizations. RShiny was used to develop our interactive web application to display results. GitHub was used for team collaboration, development, and version control. ZenHub is the agile project management tool integrated with GitHub.
For the agile development process we broke the project into six sprints, with weekly correspondence with the team at Benefitfocus and stand-ups with our professor and advisor, Dr. Paul Anderson.
To gain a top-level understanding of the data set, we generated variable correlation plots using all of the numeric and ordered variables. It is evident from these plots that behaviors are generally correlated with one another. Given that this is the case, it is possible that this will affect out variable importance later on in modelling as some of the predictors may be giving the same information.
We noticed weaker correlations among papsmear, breast exam, and mamogram. Because men do not receive those exams, we subset the data set by gender to confirm our results that behaviors are strongly correlated. In the only women subset plot, the correlations are stronger for those variables.
Our best model is a random forest that predicts hospitilization using various predictor variables such as Behaviors, Controls, and Plan Design. This is a high performing model compared to a baseline model that predicts the majority class (non-hospilization)
We then extended this approach to predict behaviors instead of hospitalization. Due to time constraints, we were not able to tune these models to the performance we wanted. However, the models to offer a little predictive power. In the future, it may be best to coerce the behaviors down to fewer levels. However, some of the models did pick out plan design variables as important when predicting behavior.
Our team further wanted to see if certain preventive care measures could be grouped together in order to simplify our prediction. We summarized the behaviors of respondents by those who follow or do not follow CDC guidelines and divided age into three categories: less than 40, 40-60, and greater than 60. We, however, did not see a greater improvement using this generalization, since it appears someone who is already getting regular checkups is also following through with recommended preventive care.
Data prep was one of our largest challenges - It took some time to understand the MEPS survey, how it was conducted, and what variables we needed to look into. Another challenge we faced was dealing with the imbalanced nature of the dataset. There are many more observations of non-hospitlization than hospilization among survey respondents. We needed to weight hospilization observations more heavily for our model to have better generaliztion accuracy.
Out of 1900 variables we able to isolate specific features for plan design, behaviors, and hospitalizations from the MEPS dataset. Created a model for predicting hospitalizations from patient behaviors with an accuracy above 88 percent, unprecedented in the industry. Developed an interactive R-Shiny app to display our results for Benefitfocus staff.