For our final project, we are interested in people's health outcomes after substance use. The question or "problem" we are trying to answer through our model is if substance usage predicts a set of negative mental health indicators/outcomes for people or not.
Our main data source is from the Substance Abuse and Mental Health Services Administration survey published in March 2023 (data from a 2019 survey).
With 58,034 rows and 2,988 variables to pull from, we thought this would be ample data to model.
- Random forest
- Numpy
- Pycaret
- Python
- Gradient Boosting Classifier
- imblearn packages
- Jupyter notebook
- Pandas
- Matplotlib
- Seaborn
Exploratory data analysis (EDA) was the first step in our project. Our interest was particularly in marijuana-related variables, given the recent discourse around cannabis use disorder.
So first, we pulled a handful of indicators relating to marijuana and mental health indicators that were focused on respondent's experiences in the past year. We focused binary indicators during our initial EDA, preprocessing the data for an initial model.
Using random forest (fresh_look.ipynb)
MJEVER, which is the indicator for if respondents have used marijuana previously, was showed as the most important feature.
What we don't know is if this model predicted better or worse mental health outcomes for users of marijuana.
After looking through the code book more thoroughly and having a better understanding the of methodology, we created a data frame (shown in MainV2_copy.ipynb.) that would answer our initial question better, ultimately landing on 15 variables. These variables considered more substances and predictors of health, including demographic and economic factors.
-
CATAG3 = Recoded field - 5 levels of age groups
-
HEALTH2 = Recoded field - Overall health
-
ANYHLTI2 = Recoded field - Any health insurance
-
INCOME = Recoded field - Total family income
-
POVERTY3 = Recoded field - Poverty level
-
TOBFLAG = Used tobacco at least once in lifetime
-
MRJFLAG = Used marijuana at least once in lifetime
-
PYUD5MRJ = Recoded field - Marijuana disorder
-
MJYRTOT = Use of marijuana - days of past year
-
ALCFLAG = Used alcohol at least once in lifetime
-
COCFLAG = Used cocaine at least once in lifetime
-
CRKFLAG = Used crack at least once in lifetime
-
HERFLAG = Used heroine at least once in lifetime
-
LSDFLAG = Used LSD at least once in lifetime
-
METHAMFLAG =Used metamphetamines at least once in lifetime
-
Target (y) : Depression - Combination of ADSMMDEA & YODSMMDE - show at least 5 symptoms of depression
We used the Pycaret library to see which would be the best model for our use which showed a Gradient Boosting Classifier. (pycaret_addprev_starter.ipynb)
After initial exploratory data analysis, we built a pilot model with only binary substance use variables using scikit-learn’s Random Forest Classifier. Our first model’s target was built from the variable ADDPREV (several days or longer when felt sad/empty/depressed). After thorough review of the Survey Codebook, wee xpanded predictors to include demographic information and selected a more reliable/representative target for Major Depression.
Feature importance: * insert image *
Based on using depression as a target, we found in our features that marijuana use did have an effect on mental health
Marijuana use was between the most important features to all models compared to other binary substance use variables (outside of alcohol in youth)
We can use the variable smoker model to “catch more” people with depression