OpenCaseStudies

Build Status

Important links

Disclaimer

The purpose of the Open Case Studies project is to demonstrate the use of various data science methods, tools, and software in the context of messy, real-world data. A given case study does not cover all aspects of the research process, is not claiming to be the most appropriate way to analyze a given dataset, and should not be used in the context of making policy decisions without external consultation from scientific experts.

License

This case study is part of the OpenCaseStudies project. This work is licensed under the Creative Commons Attribution-NonCommercial 3.0 (CC BY-NC 3.0) United States License.

Citation

To cite this case study:

Wright, Carrie and Ontiveros, Michael and Jager, Leah and Taub, Margaret and Hicks, Stephanie. (2020). https://github.com/opencasestudies/ocs-bp-vaping-case-study. Vaping Behaviors in American Youth (Version v1.0.0).

Acknowledgments

We would like to acknowledge Renee Johnson for assisting in framing the major direction of the case study and for reviewing the case study for subject matter content.

We would also like to acknowledge the Bloomberg American Health Initiative for funding this work.

Title

Vaping Behaviors in American Youth

Motivation

Recent research suggests that overall use of tobacco by youths (middle school and high school aged students) has increased in the last few years, despite longer trends of declined use in previous years. This increase has been attributed to a rapid and dramatic increase in the use of e-cigarettes and other vaping products starting in 2017. This case study explores the trends of tobacco product usage among american youths surveyed in the National Youth Tobacco Survey (NYTS) which is an annual survey that asks students in high school and middle school (grades 6-12) about tobacco usage in the United States of America. Although this survey was started in 1999, data from 2015-2019 will be used in this case study as these are the only years that asked questions about e-cigarette usage.

Motivating questions

  1. How has tobacco and e-cigarette/vaping use by American youths changed since 2015?

  2. How does e-cigarette use compare between males and females?

  3. What vaping brands and flavors appear to be used the most frequently?
    This is based on the following survey questions:
    > “During the past 30 days, what brand of e-cigarettes did you usually use?”
    > “What flavors of tobacco products have you used in the past 30 days?”

  4. Is there a relationship between e-cigarette/vaping use and other tobacco use?

Data

Survey data from the National Youth Tobacco Survey (NYTS) for 2015, 2016, 2017, 2018, and 2019. Each year has it’s own code book and excel file. Questions were slightly different for each year.

Learning Objectives

The skills, methods, and concepts that students will be familiar with by the end of this case study are:

Data Science Learning Objectives: 1. Import data from Excel files 2. Merge data from multiple similar but not identical data structures 3. Create effective longitudinal data visualizations 4. Write functions in R 5. Apply functions across data subsets using purrr and dplyr functionality.

Statistical Learning Objectives:

  1. Understanding of different types of longitudinal data
  2. Usage of code books
  3. Conceptual understanding of survey weighting
  4. Implementing logistic regression with survey weighting

Data import

In this case study we cover data import using the Tidyverse readxl package to import the excel files for each year of the survey. We also use the map() function of the Tidyverse purrr package to efficiently perform the data importation on all the files we one command.

Data wrangling

This case study goes into great detail about using codebooks to select the survey questions of interest and to recode the numeric data using the recode function of the dplyr package to reflect the responses of the students surveyed. As multiple questions needed to be similarly recoded across the different survey years, we demonstrate how to write functions and use the purrr package to apply these functions efficiently to all the data for the various years.

We also cover how to create new variables using the mutate() function and the case_when() function of the dplyr package to represent specific subgroups of surveyed students that meet various conditions.

Finally we also demonstrate how to use the bind_rows() function and the dplyr package to combine data.

Data Visualization

This case study particularly focuses on creating effective visualizations to compare groups over time using the Tidyverse ggplot2 package.

We also cover how to add confidence intervals error bars to geom_line() plots using geom_segment().

Analysis

This case study covers the use of the srvyr package to calculate survey weighted means of various groups using information about the survey design, strata, survey weights, and Primary Sampling Unit (PSU) from the codebooks and Methodology Reports.for the respective survey years.

We also perform a logistic regression analysis comparing vaping rates among males and females using survey weighting using the svyglm function of the srvyr package.

Other notes and resources

Tidyverse
Writing functions
Codebooks
Longitudinal studies
Panel data
Cross-sectional data
Survey weighting
Confidence intervals
Introduction to Logarithms
Logarithm Rules of logs Odds ratio
Log odds
2x2 table
Probability
Likelihood function
Maximum likelihood estimates
Linear regression model
Logistic regression
Quasi-likelihood
Binomial distribution

For more information on linear regression see this book and this case study.

For more information on survey designs see here and here.

For more information on survey analysis in R here and here.

If you are interested in an info-graphic summary of the 2019 findings, and links to many more resources about this topic and data set, see the FDA’s website here.

Packages used in this case study:

Package Use in this case study
here to easily load and save data
readxl to import the data in the excel files
magrittr to use the compound assignment pipe operator %<>%
stringr to manipulate the character strings within the data
purrr to import the data in all the different excel and csv files efficiently
dplyr to arrange/filter/select/compare specific subsets of the data
readr to import the CSV file data
tidyr to rearrange data in wide and long formats
ggplot2 to make visualizations with multiple layers
scales to allow us to look at the colors within the viridis package
viridis to make plots with a color palette that is compatible with color blindness
forcats to allow for reordering of factors in plots
naniar to make a visualization of missing data
syrvr to use survey weights
cowplot to allow plots to be combined
broom to create nicely formatted model output
survey to fit survey-weighted logistic regression

For users

There is a Makefile in this folder that allows you to type make to knit the case study contained in the index.Rmd to index.html and it will also knit the README.Rmd to a markdown file (README.md).

For instructors

Instructors can start at the Data Visualization section or at the Survey Weighting section. However, if instructors choose to start at the Survey Weighting section, then they need to comment out or delete the Summary Plot section.

Target audience

For individuals or classes with some familiarity with regression. See this case study for an introduction to regression.

Suggested homework

Calculate confidence intervals for the unweighted estimates and add the appropriate error bars to the main figures. Apply survey weights to one of the figures produced in this case study in which weighted estimates were not produced. Include error bars in the updated figure. Does the figure change after the application of survey weights? If so, describe how.
Reproduce final_plot above for a different cohort of your choice. Focusing on a single year of data, explore demographic factors that contribute to tobacco use of some kind. Compare results of unweighted and weighted analysis (for example, using the svyglm function to calculate survey-weighted logistic regression estimates).