In this lab, we'll learn how to make use of our newfound knowledge of pivot tables to work with real-world data.
In this lab you will:
- Describe what is meant by long and wide format data
- Use multi-hierarchical indexing to access aggregated data
- Use pivot to create a more organized aggregated DataFrame
- Use stack and unstack to move between different level of multi-indexing
In the cell below:
- Import
pandas
and set the standard alias - Import
matplotlib.pyplot
and set the standard alias - Run the iPython magic command to display matplotlib graphs inline within the notebook
The data for this activity is stored in a file called 'causes_of_death.tsv'
which is a somewhat morbid dataset from the center for disease control. Note that the file extension .tsv indicates that this data is formatted slightly differently then the standard .csv, the difference being that it has 'tab separated values' instead of 'comma separated values'. As such, pass in the optional parameter delimiter='\t'
into the pd.read_csv()
function.
df = None
Now, display the head of the DataFrame to ensure everything loaded correctly.
Our data is currently in Wide format. We can tidy this up by converting it to Long format by using groupby statements to aggregate our data into a much neater, more readable format.
Complete the following groupby statements.
- Groupby
State
andGender
. Sum the values.
# Your code here
- Groupby
State
,Gender
, andRace
. Find the average values.
# Your code here
- Groupby
Gender
andRace
. Find the minimum values.
# Your code here
Create a bar chart of the total number of deaths by state:
- Sort your columns in order (ascending or descending are both acceptable).
- Also make sure to include a title, axes labels and have your graph be an appropriate size.
NOTE: In order to do this, slice the Deaths
column after the .groupby()
method, but before the .sum()
method. You can even chain the .plot()
method on after the .sum()
method and do this all on one line, excluding the labeling of the graph!
# Your code here
Let's go one step further and print the data type of each column.
In the cell below, use the .info()
method of the DataFrame, and note the data type that each column is currently stored as.
Let's look at some samples from the Population
column to see if the current encoding seems appropriate for the data it contains.
In the cell below, display the population values for the first 5 rows in the DataFrame.
Just to be extra sure, let's check the value counts to see how many times each unique value shows up in the dataset. We'll only look at the top 5.
In the cell below, print out the top 5 value_counts()
of the population column of the DataFrame.
Clearly, this data should be stored as a numeric type, not a categorical type.
As it stands, not all values can be reformated as integers. Most of the cells in the Population
column contain integer values, but the entire column is currently encoded in string format because some cells contain the string 'Not Applicable'
.
We need to remove these rows before we can cast the Population
column to an integer data type.
In the cell below:
- Slice the rows of
df
where thePopulation
column is equal to'Not Applicable'
- Use
to_drop.index
to drop the offending rows fromdf
. Be sure to set theaxis=0
, andinplace=True
- Cast the
Population
column to an integer data type using the.astype()
method, with the single parameterint64
passed in - Print the
Population
column'sdtype
attribute to confirm it is now stored inint64
format
NOTE: .astype()
returns a copy of the column, so make sure you set the Population column equal to what this method returns--don't just call it!
# Your code here
to_drop = None
Now that we've reformatted our data, let's create a bar chart of the mean Population
by State
.
# Your code here
Below we will investigate how we can combine the .pivot()
method along with the .groupby()
method to combine some cool stacked bar charts!
In the cell below:
- Group
df
by'State'
and'Gender'
, and then slice both'Deaths'
and'Population'
from it. Chain the.agg()
method to return the mean, min, max, and standard deviation of these sliced columns.
NOTE: This only requires one line of code.
By now, you've probably caught on that the code required to do this follows this pattern: ([things to group by])[columns to slice].agg([aggregates to return])
Then, display the .head()
of this new DataFrame.
# Your code here
grouped = None
Note how Pandas denotes a multi-hierarchical index in the DataFrame above.
Let's inspect how a multi-hierarchical index is actually stored.
In the cell below, display the index
attribute of this DataFrame.
A two-dimensional array denotes the multiple levels, with each possible combination being a row in our grouped
DataFrame.
Let's reset the index, and then see how it changes.
In the cell below, call the DataFrame's .reset_index()
method. Then, display the .head()
of the DataFrame.
# First, reset the index. Notice the subtle difference; State and Gender are now columns rather than the index
grouped = None
Note how the way index is displayed has changed. The index columns that made up the multi-hierarchical index before are now stored as columns of data, with each row given a more traditional numerical index.
Let's confirm this by reexamining the index
attribute of grouped
in the cell below.
However, look again at the displayed DataFrame -- specifically, the columns. Resetting the index has caused the DataFrame to use a multi-indexed structure for the columns.
In the cell below, examine the columns
attribute of grouped
to confirm this.
# Notice that this causes columns to be MultiIndexed!
Since we're working with multi-hierarchical indices, we can examine the indices available at each level.
In the cell below, use the .get_level_values()
method contained within the DataFrame's columns
attribute to get the values for the outermost layer of the index.
Now, get the level values for the inner layer of the index.
We can also flatten the DataFrame from a multi-hierarchical index to a more traditional one-dimensional index. We do this by creating each unique combination possible of every level of the multi-hierarchical index. Since this is a complex task, you do not need to write it -- but take some time to examine the code in the cell below and see if you can understand how it works!
# We could also flatten these:
cols0 = grouped.columns.get_level_values(0)
cols1 = grouped.columns.get_level_values(1)
grouped.columns = [col0 + '_' + col1 if col1 != '' else col0 for col0, col1 in list(zip(cols0, cols1))]
# The list comprehension above is more complicated then what we need but creates a nicer formatting and
# demonstrates using a conditional within a list comprehension.
# This simpler version works but has some tail underscores where col1 is blank:
# grouped.columns = [col0 + '_' + col1 for col0, col1 in list(zip(cols0, cols1))]
grouped.columns
Now that we've flattened the DataFrame, let's inspect a couple rows to see what it looks like.
In the cell below, inspect the .head()
of the grouped
DataFrame.
Now, we'll gain some practice using the DataFrame's built-in .pivot()
method.
In the cell below, call the DataFrame's .pivot()
method with the following parameters:
- index =
'State'
- columns =
'Gender'
- values =
'Deaths_mean'
Then, display the .head()
of our new pivot
DataFrame to see what it looks like.
# Now it's time to pivot!
pivot = None
Great! We've just created a pivot table.
Let's reset the index and see how it changes our pivot table.
In the cell below, reset the index of the pivot
object as we did previously. Then, display the .head()
of the object to see if we can detect any changes.
# Again, notice the subtle difference of resetting the index:
pivot = None
Now, we'll make use of our newly created pivot table to quickly create some visualizations of our data.
In the cell below, call pivot.plot()
with the following parameters:
- kind =
'barh'
- figsize =
(15,8)
# Now let's make a sweet bar chart!!
Notice the Y-axis is currently just a list of numbers. That's because when we reset the index, it defaulted to assigning integers as the index for the DataFrame. Let's set the index back to 'State'
, and then recreate the visualization.
In the cell below:
- Use the
pivot
object's.set_index()
method and set the index to'State'
. Then, chain this with a.plot()
call to recreate the visualization using the code we used in the cell above.
All the code in this cell should be done in a single line. Just call the methods -- do not rebind pivot
to be equal to this line of code.
# Where's the states?! Notice the y-axis is just a list of numbers.
# This is populated by the DataFrame's index.
# When we used the .reset_index() method, we created a new numbered index to name each row.
# Let's fix that by making state the index again.
Now that we've created a visualization with the states as the y-axis, let's print out the head of the pivot
object again.
# Also notice that if we call the DataFrame pivot again, state is not it's index.
# The above method returned a DataFrame with State as index and we plotted it,
# but it did not update the DataFrame itself.
Note that the index has not changed. That's because the code we wrote when we set the index to the 'State'
column returns a copy of the DataFrame object with the index set to 'State'
-- by default, it does not mutate original pivot
object.
If we want to do that, we'll need to capture the new object returned by updating the contents of the pivot
variable.
In the cell below, set the index of pivot
to 'State'
. Then, recreate the bar plot using this new object.
# If we wanted to more permanently change the index we would set it first and then plot:
pivot = None
Again, let's check the .head()
of the DataFrame to confirm that the index structure has changed.
Finally, let's stack these bar charts to see how that looks.
In the cell below, recreate the visualization we did in the cell above, but this time, also pass in stacked=True
as a parameter.
# Lastly, let's stack each of these bars for each state.
# Notice we don't have to worry about index here, because we've already set it above.
Now, let's get some practice stacking and unstacking DataFrames.
In the cell below, let's display the head of grouped
to remind ourselves of the format we left it in.
As we can see above, grouped
is currently in a flattened format, with no hierarchical structure to it's indices.
In the cell below, call the grouped
DataFrame's .stack()
method.
As we can see, the .stack()
method has stacked our DataFrame from a flattened format into one with a multi-hierarchical index! This is an easy, quick way to aggregate our data.
Now, we'll explore unstacking with the pivot
DataFrame, which is already stacked into a pivot table.
In the cell below, set unstack pivot
using the object's .unstack()
method. Then, display the object to see how it has changed.
pivot = None
Note that it has unstacked the multi-hierarchical structure of the pivot
DataFrame by one level. Let's call it one more time and display the results!
In the cell below, set pivot
equal to pivot.unstack()
again, and then print the pivot
object to see how things have changed.
pivot = None
After calling unstack a second time, we can see that pivot
has a flattened structure since it has been completely unstacked!
In this lab, we learned how to:
- Use
.groupby()
to stack and slice data conditionally - Use aggregate methods in combination with groupby statements
- Create pivot tables with pandas
- Leverage pivot tables and groupby statements to create quick visualizations
stack
andunstack
DataFrames