Pivot Tables with Pandas - Lab

Introduction

In this lab, we'll learn how to make use of our newfound knowledge of pivot tables to work with real-world data.

Objectives

In this lab you will:

  • Describe what is meant by long and wide format data
  • Use multi-hierarchical indexing to access aggregated data
  • Use pivot to create a more organized aggregated DataFrame
  • Use stack and unstack to move between different level of multi-indexing

Getting Started

In the cell below:

  • Import pandas and set the standard alias
  • Import matplotlib.pyplot and set the standard alias
  • Run the iPython magic command to display matplotlib graphs inline within the notebook

Load the data

The data for this activity is stored in a file called 'causes_of_death.tsv' which is a somewhat morbid dataset from the center for disease control. Note that the file extension .tsv indicates that this data is formatted slightly differently then the standard .csv, the difference being that it has 'tab separated values' instead of 'comma separated values'. As such, pass in the optional parameter delimiter='\t' into the pd.read_csv() function.

df = None

Now, display the head of the DataFrame to ensure everything loaded correctly.

Our data is currently in Wide format. We can tidy this up by converting it to Long format by using groupby statements to aggregate our data into a much neater, more readable format.

Groupby aggregations

Complete the following groupby statements.

  • Groupby State and Gender. Sum the values.
# Your code here
  • Groupby State, Gender, and Race. Find the average values.
# Your code here
  • Groupby Gender and Race. Find the minimum values.
# Your code here

Create a bar chart of the total number of deaths by state:

  • Sort your columns in order (ascending or descending are both acceptable).
  • Also make sure to include a title, axes labels and have your graph be an appropriate size.

NOTE: In order to do this, slice the Deaths column after the .groupby() method, but before the .sum() method. You can even chain the .plot() method on after the .sum() method and do this all on one line, excluding the labeling of the graph!

# Your code here

Inspecting our data

Let's go one step further and print the data type of each column.

In the cell below, use the .info() method of the DataFrame, and note the data type that each column is currently stored as.

Let's look at some samples from the Population column to see if the current encoding seems appropriate for the data it contains.

In the cell below, display the population values for the first 5 rows in the DataFrame.

Just to be extra sure, let's check the value counts to see how many times each unique value shows up in the dataset. We'll only look at the top 5.

In the cell below, print out the top 5 value_counts() of the population column of the DataFrame.

Clearly, this data should be stored as a numeric type, not a categorical type.

Reformat the Population column as an integer

As it stands, not all values can be reformated as integers. Most of the cells in the Population column contain integer values, but the entire column is currently encoded in string format because some cells contain the string 'Not Applicable'.

We need to remove these rows before we can cast the Population column to an integer data type.

In the cell below:

  • Slice the rows of df where the Population column is equal to 'Not Applicable'
  • Use to_drop.index to drop the offending rows from df. Be sure to set the axis=0, and inplace=True
  • Cast the Population column to an integer data type using the .astype() method, with the single parameter int64 passed in
  • Print the Population column's dtype attribute to confirm it is now stored in int64 format

NOTE: .astype() returns a copy of the column, so make sure you set the Population column equal to what this method returns--don't just call it!

# Your code here
to_drop = None

Complete the bar chart

Now that we've reformatted our data, let's create a bar chart of the mean Population by State.

# Your code here

Below we will investigate how we can combine the .pivot() method along with the .groupby() method to combine some cool stacked bar charts!

Use aggregate methods

In the cell below:

  • Group df by 'State' and 'Gender', and then slice both 'Deaths' and 'Population' from it. Chain the .agg() method to return the mean, min, max, and standard deviation of these sliced columns.

NOTE: This only requires one line of code.

By now, you've probably caught on that the code required to do this follows this pattern: ([things to group by])[columns to slice].agg([aggregates to return])

Then, display the .head() of this new DataFrame.

# Your code here
grouped = None

Note how Pandas denotes a multi-hierarchical index in the DataFrame above.

Let's inspect how a multi-hierarchical index is actually stored.

In the cell below, display the index attribute of this DataFrame.

A two-dimensional array denotes the multiple levels, with each possible combination being a row in our grouped DataFrame.

Let's reset the index, and then see how it changes.

In the cell below, call the DataFrame's .reset_index() method. Then, display the .head() of the DataFrame.

# First, reset the index. Notice the subtle difference; State and Gender are now columns rather than the index 
grouped = None

Note how the way index is displayed has changed. The index columns that made up the multi-hierarchical index before are now stored as columns of data, with each row given a more traditional numerical index.

Let's confirm this by reexamining the index attribute of grouped in the cell below.

However, look again at the displayed DataFrame -- specifically, the columns. Resetting the index has caused the DataFrame to use a multi-indexed structure for the columns.

In the cell below, examine the columns attribute of grouped to confirm this.

# Notice that this causes columns to be MultiIndexed!

Column levels

Since we're working with multi-hierarchical indices, we can examine the indices available at each level.

In the cell below, use the .get_level_values() method contained within the DataFrame's columns attribute to get the values for the outermost layer of the index.

Now, get the level values for the inner layer of the index.

Flattening the DataFrame

We can also flatten the DataFrame from a multi-hierarchical index to a more traditional one-dimensional index. We do this by creating each unique combination possible of every level of the multi-hierarchical index. Since this is a complex task, you do not need to write it -- but take some time to examine the code in the cell below and see if you can understand how it works!

# We could also flatten these:
cols0 = grouped.columns.get_level_values(0)
cols1 = grouped.columns.get_level_values(1)
grouped.columns = [col0 + '_' + col1 if col1 != '' else col0 for col0, col1 in list(zip(cols0, cols1))]
# The list comprehension above is more complicated then what we need but creates a nicer formatting and
# demonstrates using a conditional within a list comprehension.
# This simpler version works but has some tail underscores where col1 is blank:
# grouped.columns = [col0 + '_' + col1 for col0, col1 in list(zip(cols0, cols1))]
grouped.columns

Now that we've flattened the DataFrame, let's inspect a couple rows to see what it looks like.

In the cell below, inspect the .head() of the grouped DataFrame.

Using pivots

Now, we'll gain some practice using the DataFrame's built-in .pivot() method.

In the cell below, call the DataFrame's .pivot() method with the following parameters:

  • index = 'State'
  • columns = 'Gender'
  • values = 'Deaths_mean'

Then, display the .head() of our new pivot DataFrame to see what it looks like.

# Now it's time to pivot!
pivot = None

Great! We've just created a pivot table.

Let's reset the index and see how it changes our pivot table.

In the cell below, reset the index of the pivot object as we did previously. Then, display the .head() of the object to see if we can detect any changes.

# Again, notice the subtle difference of resetting the index:
pivot = None

Visualizing Data With Pivot Tables

Now, we'll make use of our newly created pivot table to quickly create some visualizations of our data.

In the cell below, call pivot.plot() with the following parameters:

  • kind = 'barh'
  • figsize = (15,8)
# Now let's make a sweet bar chart!!

Notice the Y-axis is currently just a list of numbers. That's because when we reset the index, it defaulted to assigning integers as the index for the DataFrame. Let's set the index back to 'State', and then recreate the visualization.

In the cell below:

  • Use the pivot object's .set_index() method and set the index to 'State'. Then, chain this with a .plot() call to recreate the visualization using the code we used in the cell above.

All the code in this cell should be done in a single line. Just call the methods -- do not rebind pivot to be equal to this line of code.

# Where's the states?! Notice the y-axis is just a list of numbers. 
# This is populated by the DataFrame's index.
# When we used the .reset_index() method, we created a new numbered index to name each row. 
# Let's fix that by making state the index again.

Now that we've created a visualization with the states as the y-axis, let's print out the head of the pivot object again.

# Also notice that if we call the DataFrame pivot again, state is not it's index.
# The above method returned a DataFrame with State as index and we plotted it,
# but it did not update the DataFrame itself.

Note that the index has not changed. That's because the code we wrote when we set the index to the 'State' column returns a copy of the DataFrame object with the index set to 'State' -- by default, it does not mutate original pivot object.

If we want to do that, we'll need to capture the new object returned by updating the contents of the pivot variable.

In the cell below, set the index of pivot to 'State'. Then, recreate the bar plot using this new object.

# If we wanted to more permanently change the index we would set it first and then plot:
pivot = None

Again, let's check the .head() of the DataFrame to confirm that the index structure has changed.

Finally, let's stack these bar charts to see how that looks.

In the cell below, recreate the visualization we did in the cell above, but this time, also pass in stacked=True as a parameter.

# Lastly, let's stack each of these bars for each state.
# Notice we don't have to worry about index here, because we've already set it above.

Stacking and Unstacking DataFrames

Now, let's get some practice stacking and unstacking DataFrames.

Stacking

In the cell below, let's display the head of grouped to remind ourselves of the format we left it in.

As we can see above, grouped is currently in a flattened format, with no hierarchical structure to it's indices.

In the cell below, call the grouped DataFrame's .stack() method.

As we can see, the .stack() method has stacked our DataFrame from a flattened format into one with a multi-hierarchical index! This is an easy, quick way to aggregate our data.

Unstacking

Now, we'll explore unstacking with the pivot DataFrame, which is already stacked into a pivot table.

In the cell below, set unstack pivot using the object's .unstack() method. Then, display the object to see how it has changed.

pivot = None

Note that it has unstacked the multi-hierarchical structure of the pivot DataFrame by one level. Let's call it one more time and display the results!

In the cell below, set pivot equal to pivot.unstack() again, and then print the pivot object to see how things have changed.

pivot = None

After calling unstack a second time, we can see that pivot has a flattened structure since it has been completely unstacked!

Summary

In this lab, we learned how to:

  • Use .groupby() to stack and slice data conditionally
  • Use aggregate methods in combination with groupby statements
  • Create pivot tables with pandas
  • Leverage pivot tables and groupby statements to create quick visualizations
  • stack and unstack DataFrames