- HTML: https://www.opencasestudies.org/ocs-bp-vaping-case-study/
- GitHub: https://github.com/opencasestudies/ocs-bp-vaping-case-study/
- Bloomberg American Health Initiative: https://americanhealth.jhu.edu/open-case-studies
The purpose of the Open Case Studies project is to demonstrate the use of various data science methods, tools, and software in the context of messy, real-world data. A given case study does not cover all aspects of the research process, is not claiming to be the most appropriate way to analyze a given dataset, and should not be used in the context of making policy decisions without external consultation from scientific experts.
This case study is part of the OpenCaseStudies project. This work is licensed under the Creative Commons Attribution-NonCommercial 3.0 (CC BY-NC 3.0) United States License.
To cite this case study:
Wright, Carrie and Ontiveros, Michael and Meng, Qier and Jager, Leah and Taub, Margaret and Hicks, Stephanie. (2020). https://github.com/opencasestudies/ocs-bp-vaping-case-study. Vaping Behaviors in American Youth (Version v1.0.0).
We would like to acknowledge Renee Johnson for assisting in framing the major direction of the case study and for reviewing the case study for subject matter content.
We would like to acknowledge Michael
Breshock for his contributions to this
case study and developing the OCSdata
package.
We would also like to acknowledge the Bloomberg American Health Initiative for funding this work.
The total reading time for this case study was calculated with koRpus: About 75 minutes
The Flesch-Kincaid Readability Index was also calculated with koRpus: Grade 10, Age 15
Vaping Behaviors in American Youth
Recent research suggests that overall use of tobacco by youths (middle school and high school aged students) has increased in the last few years, despite longer trends of declined use in previous years. This increase has been attributed to a rapid and dramatic increase in the use of e-cigarettes and other vaping products starting in 2017. This case study explores the trends of tobacco product usage among american youths surveyed in the National Youth Tobacco Survey (NYTS) which is an annual survey that asks students in high school and middle school (grades 6-12) about tobacco usage in the United States of America. Although this survey was started in 1999, data from 2015-2019 will be used in this case study as these are the only years that asked questions about e-cigarette usage.
-
How has tobacco and e-cigarette/vaping use by American youths changed since 2015?
-
How does e-cigarette use compare between males and females?
-
What vaping brands and flavors appear to be used the most frequently?
This is based on the following survey questions:
> “During the past 30 days, what brand of e-cigarettes did you usually use?”
> “What flavors of tobacco products have you used in the past 30 days?” -
Is there a relationship between e-cigarette/vaping use and other tobacco use?
Survey data from the National Youth Tobacco Survey (NYTS) for 2015, 2016, 2017, 2018, and 2019. Each year has it’s own code book and excel file. Questions were slightly different for each year.
The skills, methods, and concepts that students will be familiar with by the end of this case study are:
Data Science Learning Objectives: 1. Import data from Excel
files 2. Merge data from multiple similar but not identical data
structures 3. Create effective longitudinal data visualizations 4. Write
functions in R 5. Apply functions across data subsets using purrr
and
dplyr
functionality.
Statistical Learning Objectives:
- Understanding of different types of longitudinal data
- Usage of code books
- Conceptual understanding of survey weighting
- Implementing logistic regression with survey weighting
In this case study we cover data import using the Tidyverse readxl
package to import the excel files for each year of the survey. We also
use the map()
function of the Tidyverse purrr
package to efficiently
perform the data importation on all the files we one command.
This case study goes into great detail about using codebooks to select
the survey questions of interest and to recode the numeric data using
the recode
function of the dplyr
package to reflect the responses of
the students surveyed. As multiple questions needed to be similarly
recoded across the different survey years, we demonstrate how to write
functions and use the purrr
package to apply these functions
efficiently to all the data for the various years.
We also cover how to create new variables using the mutate()
function
and the case_when()
function of the dplyr
package to represent
specific subgroups of surveyed students that meet various conditions.
Finally we also demonstrate how to use the bind_rows()
function and
the dplyr
package to combine data.
This case study particularly focuses on creating effective
visualizations to compare groups over time using the Tidyverse ggplot2
package.
We also cover how to add confidence intervals error bars to
geom_line()
plots using geom_segment()
.
This case study covers the use of the srvyr
package to calculate
survey weighted means of various groups using information about the
survey design, strata, survey weights, and Primary Sampling Unit (PSU)
from the codebooks and Methodology Reports.for the respective survey years.
We also perform a logistic regression analysis comparing vaping rates
among males and females using survey weighting using the svyglm
function of the srvyr
package.
Tidyverse
Writing
functions
Codebooks
Longitudinal studies
Panel
data
Cross-sectional data
Survey weighting
Confidence intervals
Introduction to Logarithms
Logarithm Rules of logs
Odds
ratio
Log
odds
2x2 table
Probability
Likelihood function
Maximum likelihood estimates
Linear regression model
Logistic regression
Quasi-likelihood
Binomial distribution
For more information on linear regression see this book and this case study.
For more information on survey designs see here and here.
For more information on survey analysis in R here and here.
If you are interested in an info-graphic summary of the 2019 findings, and links to many more resources about this topic and data set, see the FDA’s website here.
Packages used in this case study:
Package | Use in this case study |
---|---|
here | to easily load and save data |
readxl | to import the data in the excel files |
magrittr | to use the compound assignment pipe operator
%<>% |
stringr | to manipulate the character strings within the data |
purrr | to import the data in all the different excel and csv files efficiently |
dplyr | to arrange/filter/select/compare specific subsets of the data |
readr | to import the CSV file data |
tidyr | to rearrange data in wide and long formats |
ggplot2 | to make visualizations with multiple layers |
scales | to allow us to look at the colors within the viridis package |
viridis | to make plots with a color palette that is compatible with color blindness |
forcats | to allow for reordering of factors in plots |
naniar | to make a visualization of missing data |
syrvr | to use survey weights |
cowplot | to allow plots to be combined |
broom | to create nicely formatted model output |
survey | to fit survey-weighted logistic regression |
There is a Makefile
in this folder that allows you to type
make
to knit the case study contained in the index.Rmd
to
index.html
and it will also knit the README.Rmd
to a
markdown file (README.md
).
Instructors can start at the Data Visualization section or at the Survey Weighting section. However, if instructors choose to start at the Survey Weighting section, then they need to comment out or delete the Summary Plot section.
For individuals or classes with some familiarity with regression. See this case study for an introduction to regression.
Calculate confidence intervals for the unweighted estimates and add the
appropriate error bars to the main figures. Apply survey weights to one
of the figures produced in this case study in which weighted estimates
were not produced. Include error bars in the updated figure. Does the
figure change after the application of survey weights? If so, describe
how.
Reproduce final_plot
above for a different cohort of your choice.
Focusing on a single year of data, explore demographic factors that
contribute to tobacco use of some kind. Compare results of unweighted
and weighted analysis (for example, using the svyglm
function to
calculate survey-weighted logistic regression estimates).
~ About 107 - 117 seconds
This compilation time was measured on a PC machine operating on Windows 10. This range should only be used as an estimate as compilation time will vary with different machines and operating systems.