This dataset is a transformed version of the USDA Food Composition Database. It is the major source of food composition data in the United States and provides the foundation for most food composition databases in the public and private sectors. The database provides detailed composition data for components like calories, proteins, vitamins, etc. The database contains three tables: composition
, units
, and conversion
.
The composition
table is formatted as a series of nutritional measurements:
item_id
: unique identifier for each itemdesc
: The name of the item being measuredcomponent
: The nutritional component being measuredvalue
: The value of the measurement
The measurements are in different units, depending on the component being measured. The units
table maps each component to its unit of measurement. The conversion
table maps each unit of measurement to a decimal conversion factor that converts values of each given unit of measurement to grams.
Anaconda is a Python distribution that contains a collection of pre-installed, helpful libraries for data analysis. If you don't already have Python installed on your machine, you can follow the instructions here. If you do already have Python 3.4+ or Anaconda installed, make sure you have the following packages installed and up-to-date: jupyter
, pandas
, sqlite3
.
Finally, clone this repo onto your computer.
Once you've cloned the repo, run the Python script dbgen.py
, which will create and populate the SQLite database, usda.db
to be used for this analysis. If you accidentally delete or modify the tables in this database, you can re-run this script to create fresh data.
Launch a Jupyter Notebook server in the repo folder, and open the file called notebook.ipynb
in Jupyter. This file will be used for logging your analysis and saving your results.
When you're finished, send us the files notebook.ipynb
and solutions.json
(it will be generated by running the final cell in the notebook). Good luck!