/NHANES-Analytics

This repository contains cone for analysis of the NHANES dataset. Specifically, it contains code which will examine the unique food items in the NHANES dietary data. The food items are clustered based on nutrient similarities into new food groups. These food groups represent the result of a data-driven approach of developing food groups for use in dietary analysis studies.

Primary LanguagePythonOtherNOASSERTION

NHANES-Analytics

This repository contains code related to NHANES dietary data analysis for creating nutrient-driven food groups, as documented in:

M. Wyatt, T. Johnston, M. Papas, and M. Taufer. Development of a Scalable Method for Creating Food Groups Using the NHANES Dataset and MapReduce. In Proceedings of the ACM Bioinformatics and Computational Biology Conference (BCB), pp. 1 – 10. Seattle, WA, USA. October 2 – 4, 2016.

Requirements

  • Python 2
  • Apache Spark / Pyspark
  • numpy
  • scipy

Contents

The analysis is split into 2 parts:

Preprocessing ./src/preprocess.py contains the PySpark script for preprocessing the NHANES dietary data

Clustering ./src/cluster.py contains the PySpark script for clustering the preprocessed data

Usage

To run the code, you will need Apache Spark installed. You can run the code with the bash script located at ./src/run.sh. This will create several new files and directories in the ./data/ directory.

The data is saved with the spark command saveAsPickleFile and can be loaded with the spark command pickleFile. For example, to load the processed data into a pyspark session, do: sc.pickleFile("./data/processed").

Data

1 Year of NHANES data is in ./data/raw. This data and more years of NHANES data can be downloaded from http://www.cdc.gov/nchs/nhanes/nhanes_questionnaires.htm

The file ./data/features.txt contains a list of features which are to be extracted from the NHANES dietary data.

Additionally, a script to download all NHANES data is included at ./src/get_data.py. To run this, uncomment line 7 from ./src/run.sh. Or visit the main (and most up-to-date) repo for this script at https://github.com/mrwyattii/NHANES-Downloader