/DATA115_PDP2

Final corned beef vs. pickles analysis

Primary LanguageJupyter Notebook

Corned Beef vs. Pickles (in progress)

Motivation

In graduate school, I had a friend from a fairly sheltered background who was convinced that corned beef and pickles could only be eaten together. That is, no one would ever eat either separately. This was shocking to me, since I didn't think I'd ever had them together but did enjoy them both on their own. When the initial pronouncement was handed down, we spent half of the night calling people all over the country trying to discover what collection of backgrounds would lead you to this particular culinary compulsion. As you might expect, the surveying process of "call who ever is currently saved in your phone" did not return particularly generalizable results.

This issue then sat on the back burner for years until while helping a student in DATA115 track down a dataset about food consumption by age, I stumbled across the NHANES survey data collected by the CDC, which seemed to provide a potential way forward...

Data Sources

Every year, the National Center for Health Statistics performs a large survey of Americans health and consumption habits called the NHANES. They release anonymized responses to the large collection of questions and measurements that the gather data on. For this project I used the data from 2005. Specifcally, the Demographic Variables and the raw reponses to the Food Frequency Questionaire. This raw data is presented above.

Processing Steps

A significant amount of processing was necessary to actually analyze this dataset. To begin with, the raw data provided by the CDC is in a proprietary format for the SAS language, so the first step was to use the SAS viewer to export the data to a .csv to start the cleaning process. Then, I merged the demographic data with the relevant food consumption columns. In both cases, the given column labels were confusing so I replaced them with interpretable ones. There was also a great deal of excess data that I didn't need for the analysis, so I removed those columns from the dataset. There was demographic information on over 10k subjects but only about 6k filled out the food questionnaire, so I removed the non-participant rows from the data as well.

Each column value was coded according to a chart given in the underlying data source. For example, the values in the Pickles? column represented the following pickle eating frequencies.

Data ValueMeaning
1 Never
2 1-6 times per year
3 7-11 times per year
4 1 time per month
5 2-3 times per month
6 1 time per week
7 2 times per week
8 3-4 times per week
9 5-6 times per week
10 1 time per day
11 2 or more times per day
88 Blank
99 Error

Similar tables can be found at this link for the demographic columns.

Visualization

To begin with, I made a scatterplot of the various food frequency scores for the two food categories to see if there really were people who liked to eat both equally. This result shows that almost every possible category is filled, falsifying my classmate's hypothesis.

This plot obscures the number of participants who fall into each category (scatterplots are usually a poor choice for categorical data!) so we might get a better sense of the dataset using a heatmap of the contingency table, a stacked bar chart, or just looking at the actual values:

x 1 2 3 4 5 6 7 8 9 10 11 88 99
1 843 212 87 96 160 80 82 40 11 16 3 26 4
2 301 266 107 75 101 47 40 21 5 7 1 10 1
3 142 144 75 43 51 22 22 9 4 0 0 9 0
4 167 94 56 78 67 32 35 10 3 0 1 6 1
5 203 133 83 107 151 63 42 35 8 2 1 8 0
6 108 76 35 51 79 53 38 16 9 4 1 3 0
7 100 46 38 50 76 44 53 24 7 4 2 3 0
8 52 29 25 24 42 32 23 22 4 1 2 1 0
9 24 7 12 5 13 7 11 9 6 3 1 2 0
10 17 3 1 5 3 11 6 5 1 3 1 1 0
11 9 2 2 1 0 1 2 1 1 0 1 1 0
88 17 7 11 8 12 4 3 3 3 1 0 18 4
99 6 0 0 1 2 0 2 0 0 0 0 2 5

Analysis

Having answered the initial question, I started to wonder if there were demographic features that distinguished the groups who either hated or liked both types of food. I began by separating the data into two sets: one for people who never ate either product and one for people who ate both at least once per week. The plots below show the histograms of ages of people in these two categories:

Descriptions of Code and Materials

The raw data downloaded from the sources described above are uploaded in .csv form as 2005_food and 2005_demo. The processed data with better column names and only the relevant rows and columns is uploaded as processed_data and the notebook used to do the processing and make the plots is uploaded as cornedbeefpickles.ipynb.