In this lab, let's get some hands on practice working with data cleanup using Pandas.
You will be able to:
- Manipulate columns in DataFrames (df.rename, df.drop)
- Manipulate the index in DataFrames (df.reindex, df.drop, df.rename)
- Manipulate column datatypes
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.read_csv('turnstile_180901.txt')
print(len(df))
df.head()
197625
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
</style>
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
C/A | UNIT | SCP | STATION | LINENAME | DIVISION | DATE | TIME | DESC | ENTRIES | EXITS | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | A002 | R051 | 02-00-00 | 59 ST | NQR456W | BMT | 08/25/2018 | 00:00:00 | REGULAR | 6736067 | 2283184 |
1 | A002 | R051 | 02-00-00 | 59 ST | NQR456W | BMT | 08/25/2018 | 04:00:00 | REGULAR | 6736087 | 2283188 |
2 | A002 | R051 | 02-00-00 | 59 ST | NQR456W | BMT | 08/25/2018 | 08:00:00 | REGULAR | 6736105 | 2283229 |
3 | A002 | R051 | 02-00-00 | 59 ST | NQR456W | BMT | 08/25/2018 | 12:00:00 | REGULAR | 6736180 | 2283314 |
4 | A002 | R051 | 02-00-00 | 59 ST | NQR456W | BMT | 08/25/2018 | 16:00:00 | REGULAR | 6736349 | 2283384 |
#Your code here
#Your code here
# Your code here
Create another column 'Num_Lines' that is a count of how many lines pass through a station. Then sort your dataframe by this column in descending order
Hint: According to the data dictionary, LINENAME represents all train lines that can be boarded at a given station. Normally lines are represented by one character. For example, LINENAME 456NQR represents trains 4, 5, 6, N, Q, and R.
# Your code here
def clean(col_name):
cleaned = #Your code here; whatever you want to do to col_name. Hint: think back to str methods.
return cleaned
# This is a list comprehension. It applies your clean function to every item in the list.
# We then reassign that to df.columns
# You shouldn't have to change anything here.
# Your function above should work appropriately here.
df.columns = [clean(col) for col in df.columns]
# Checking the output, we can see the results.
df.columns
# Your code here
#Your code here
What is misleading about the day of week and weekend/weekday charts you just plotted?
# Your answer here
# Your code here
Great! You practiced your data cleanup-skills using Pandas.