OpenCaseStudies

Important links

Static version: https://www.opencasestudies.org/ocs-bp-rural-and-urban-obesity
Interactive version: https://rsconnect.biostat.jhsph.edu/ocs-bp-rural-and-urban-obesity-interactive/
GitHub: https://github.com/opencasestudies/ocs-bp-rural-and-urban-obesity
Bloomberg American Health Initiative: https://americanhealth.jhu.edu/open-case-studies

Disclaimer

The purpose of the Open Case Studies project is to demonstrate the use of various data science methods, tools, and software in the context of messy, real-world data. A given case study does not cover all aspects of the research process, is not claiming to be the most appropriate way to analyze a given dataset, and should not be used in the context of making policy decisions without external consultation from scientific experts.

License

This case study is part of the OpenCaseStudies project. This work is licensed under the Creative Commons Attribution-NonCommercial 3.0 (CC BY-NC 3.0) United States License.

Citation

To cite this case study:

Wright, Carrie and Meng, Qier and Jager, Leah and Taub, Margaret and Hicks, Stephanie. (2020). https://github.com/opencasestudies/ocs-bp-rural-and-urban-obesity. Exploring global patterns of obesity across rural and urban regions (Version v1.0.0).

Acknowledgments

We would like to acknowledge Jessica Fanzo for assisting in framing the major direction of the case study.

We would like to acknowledge Michael Breshock for his contributions to this case study and developing the OCSdata package.

We would also like to acknowledge the Bloomberg American Health Initiative for funding this work.

Reading Metrics

The total reading time for this case study was calculated with koRpus: About 70 minutes

The Flesch-Kincaid Readability Index was also calculated with koRpus: Grade 9, Age 14

Title

Exploring Global Patterns of Obesity from 1985 to 2017

Motivation

Body Mass Index (BMI) is often used as a proxy for adiposity with classifications based on BMI to define “underweight”, “normal”, “overweight” and “obese”, where higher BMI has been associated with increased mortality, rates of type 2 diabetes, cancer, heart disease, and stroke. A recent paper showed that contrary to a widely reported view (that urbanization is one of the most important drivers in the global rise of obesity), in fact BMI is increasing at the same rate or faster in rural areas (compared to cities), in particular in low- and middle-income regions. Also, there a gender-discrepancy (women have a higher BMI in rural communities).

Here, we explore this data to understand global patterns in obesity. This analysis is important because it may indicate the need to provide better access (financial and physical access) to healthy foods in rural communities, especially in low-income countries, to address the obesity crisis.

Motivating questions

Is there a difference between rural and urban BMI estimates around the world? In particular, what does this difference look like for women?
How have BMI estimates changed from 1985 to 2017? In particular, what does this change over time look like for women?
How do different countries compare for BMI estimates? In particular, how does the United States compare to the rest of the world?

Data

The data used in this analysis comes from a supplementary table for the following article:

NCD Risk Factor Collaboration (NCD-RisC). Rising rural body-mass index is the main driver of the global obesity epidemic in adults. Nature 569, 260–264 (2019).

This article can be found freely available online.

While gender and sex are not actually binary, the data presented that is used in this analysis only contain data for groups of individuals described as men or women.

Learning Objectives

The skills, methods, and concepts that students will be familiar with by the end of this case study are:

Data science Learning Objectives:

Importing data from a PDF (pdftools)
Subsetting and filtering data (dplyr)
Working with character strings (stringr)
Reshaping data into different formats (tidyr)
Applying functions to all columns of a tibble (purrr)
Creating data visualizations (ggplot2) with labels (ggrepel)
Combining multiple plots (cowplot and patchwork)

Statistical Learning Objectives:

Familiarity with the use of Quantile-Quantile plots to assess normality
Define and understand the utility of alpha and the p value
Describe the difference between nonparametric and parametric tests
Be able to identify paired data
Implementation of a paired t-test
Interpretation of a paired t-test
Implementation of a Wilcoxon signed-rank test
Interpretation of a Wilcoxon signed-rank test
Understanding of the need for multiple testing correction

Analysis

In this case study, we will largely focus on methods for comparing two groups using parametric and nonparametric hypothesis tests. We also cover multiple testing correction and fairly advanced data visualization methods using ggplot2.

Data import

Data is imported from a PDF using pdftools to obtain data from a large table. The beginning of this table looks like this:

Data wrangling

This case study covers many wrangling techniques and largely involves using the package stringr.

Dividing data into separate lines
Removing excess white-space
Removing redundant header information
Correcting spacing issues
Dealing with NA values that are labeled in an unusual manner
Splitting the data into columns using a delimiter
Changing variable names
Sorting the data
Converting to long format
Separating a column into multiple columns

Data exploration

To explore the data we use the summarize() function as well as plots to look at the distribution of the data. Quantile-Quantile plots are used to evaluate the distribution and compare it to the theoretical normal distribution.

Statistical concepts

This case study covers fundamental concepts in statistics such as type 1 error, alpha threshold, p-values, hypothesis testing, parametric two sample mean tests, and nonparametric two sample tests, as well as the assumptions of the various included statistical tests and what to do when data is paired.

Other notes and resources

BMI
Long and Wide Data Formats
Distributions Normal Distribution Skewed Distributions Bimodal Distribution ggplot2
Q-Q Plots
Student t-test
Paired Data
Welch’s t-test
Parametric and Nonparametric Methods
Variance
Balanced Study Design
Independent Observations
Transformation
Permutation/Resampling Methods
Central Limit Theorem
Mood’s Two-Sample Scale Test
Wilcoxon Signed Rank Test
Wilcoxon Rank Sum Test
Two-sample Kolmogorov-Smirnov Test
Type 1 Error
p-value
Multiple Testing
Bonferroni Method of Multiple Testing Correction

Packages used in this case study:

Package	Use in this case study
here	to easily load and save data with relative paths
pdftools	to read a text from pdf into R
stringr	to manipulate the text data
readr	to manipulate the text data within the pdf into individual lines
dplyr	to arrange/filter/select subsets of the data
tibble	to create data objects that we can manipulate with `dplyr`/`stringr`/`tidyr`/`purrr`
magrittr	to use the `%<>%` piping operator
glue	to paste or combine character strings and data together
purrr	to perform functions on all columns of a tibble
tidyr	to convert data from ‘wide’ to ‘long’ format
ggplot2	to make visualizations with multiple layers
ggrepel	to allow labels in figures not to overlap
cowplot and patchwork	to allow plots to be combined

For users

There is a Makefile in this folder that allows you to type make to knit the case study contained in the index.Rmd to index.html and it will also knit the README.Rmd to a markdown file (README.md).

For instructors

Our goal is for instructors to use this case study as the starting point for a set of lectures. We provide one R Markdown file (index.Rmd) for an instructor to use. However, we anticipate the instructor may either break this file up into smaller R Markdown files for multiple lectures or extract only a portion of the material (e.g. the Data Wrangling or Data Analysis sections) to use in the classroom. With the latter goal in mind, we save a Wrangled_data.rda object at the end of the Data Wrangling section, which is loaded at the start of the Data Exploration section.

Target audience

This case study is designed for undergraduate students who have not taken a statistics course. While we do not discuss the theoretical aspects of the statistics concepts used in this case study, the case study discusses the motivation behind them.

Suggested homework

Students can repeat a similar analysis, but evaluate the change in BMI over time using the global data available for each year between 2015 and 2017.

Estimate of RMarkdown Compilation Time:

~ About 31 - 41 seconds

This compilation time was measured on a PC machine operating on Windows 10. This range should only be used as an estimate as compilation time will vary with different machines and operating systems.