In this lab, we'll look at a dataset which contains information on World Cup matches. Let's use the Pandas commands learned in the previous lesson to learn more about our data!
You will be able to:
- Use pandas methods and attributes to access information about a dataset
- Index pandas dataframes with .loc, .iloc, and column names
- Use a boolean mask to index pandas series and dataframes
Load the file 'WorldCupMatches.csv'
as a DataFrame in Pandas.
# Import pandas using the standard alias
# Import 'WorldCupMatches.csv' as a DataFrame
df = None
Use the correct method to look at the first 7 rows of the dataset.
# Print the first 7 rows of df
Look at the last 3 rows of the data set.
# Print the last 3 rows of df
Get a concise summary of the data using .info()
.
# Print a concise summary of df
Obtain a tuple representing the number of rows and number of columns
# Print the number of rows and columns in df
Use the appropriate attribute to get the column names
# Print the column names of df
When looking at the DataFrame's .head()
, you might have noticed that the games are structured chronologically in the DataFrame.
Use the right selection method to print all the information from the 3rd to the 5th game.
# Print rows 3 through 5
Now, print all the info from game 5-9, but we're only interested in printing out the, "Home Team Name" and the, "Away Team Name."
# Print rows 5 through 9 and columns 'Home Team Name' and 'Away Team Name'
Next, we'd like the information on all the games played in Group 3 for the 1950 World Cup.
# Print all info for games played in 1950 for Group 3
Let's repeat the command above, but now we only want to print out the attendance column for the Group 3 games.
You can combine conditions like this:
df[(condition1) | (condition2)]
-> Returns rows where either condition is true
df[(condition1) & (condition2)]
-> Returns rows where both conditions are true
# Print the 'Attendance' column for games played in 1950 for Group 3
Throughout the entire history of the World Cup, how many home games were played by the Netherlands?
# Number of home games played by the Netherlands
How many games were played by the Netherlands in total?
# Number of games played by the Netherlands in total
Next, let's try and figure out how many games the USA played in the 2014 World Cup.
# Number of games the USA played in the 2014 world cup
Now, let's try to find out how many countries participated in the 1986 World Cup.
Hint 1: as a first step, create a new dataset that only contains games in that year.
Hint 2: You can use .unique()
to make sure you don't end up with duplicate country names.
# Number of countries participated in the 1986 world cup
In World Cup history, how matches had more than 5 goals in total?
# Number of matches that had more than 5 goals in total
With the information you currently have in your df
, create a new column, "Half-time Goals."
# Create a new column 'Half-time Goals' in df
Run the code below. You'll notice that for Korea, there are records for both North-Korea (Korea DPR) and South-Korea (Korea Republic).
# Print all records containing the string 'Korea'
df.loc[df['Home Team Name'].str.contains('Korea'), 'Home Team Name']
Imagine that, for some reason, we simply want Korea listed as one entry, so we want to replace every "Home Team Name" and "Away Team Name" entry that contains "Korea" to simply "Korea". In the same way, we want to change the columns "Home Team Initials" and "Away Team Initials" to NSK (North & South Korea) instead of "KOR" and "PRK".
# Update the 'Home Team Name' and 'Home Team Initials' columns
Make sure to verify your answer!
# Check the updated columns
In this lab, you learned how to access data within Pandas!