Import
- pandas under the alias pd
- datetime under the alias dt
- mayplotlib.pyplot under the alias plt
Run
- %matplotlib inline
Read in as data
- the csv
FoodServiceData_23_0
in the data folder and assign to the variablefood
The first question to ask yourself of a dataset: "what is this dataset treating as an observation?"
Think of an "observation" as an "event" or a "subject". For example, an observation could be a:
-
specific subject, like an individual person, with features about that person's characteristics or behaviors: medical data like
blood pressure
ortest results
, econ / sociological data likeyearly income
orcrime rate of neighorhood in which they live
, behavioral data like what products they purchased) -
aggregated subject, like in the Boston housing dataset, where each row was a suburb/town. Features can be aggregated statistics about things within the region - like
crime rate
ormedian house value
- or it can be about the specific region itself, such asdistance to Hahvahd Yahd
-
event, where each row isn't tied to a specific identity but instead tied to a specific action that occured. Often, these types of datasets will have a number of features that act as keys that distinguish events from each other, as well as features containing data about the event. For example, a store with multiple locations might have a dataset of "transactions", where the key features for each row are
Store
,Time
andTransaction ID
, with other featuresItem Purchased
,Payment Method
,Coupons Used
, etc. Notice that the same type of data - purchasing items - can be organized as either features of a "person" or an "event".
Figuring out which "observation" makes a row is an important part of figuring out how to analyze a dataset.
Take a look at the first five rows. How does this dataset appear to be organized? What is an "observation"? What are the features?
#Your code here
Which have nulls in them?
#Your code here
First, use a method to drop a specific column. Then, for the other two, use a method that will drop all columns that are completely null.
Check that only those columns were dropped.
#Your code here
Drop all rows w/ nulls for Score
. Make sure you print out how many rows there are pre-drop, how many you dropped, and how many there are after dropping!
#Your code here
Do all the nulls of Score
also have nulls for Grade
? Vice versa?
#Your code here
How does Grade
map onto Score
? Let's find the rows that have both Grade
and Score
values, group by Grade
, and see the min, max and mean for Score
for each Grade
#Your code here
#Your code here
Let's familiarize ourselves with the levels of the categories for the features that are object types
#Your code here
Test to see if they're identical
#Your code here
#Your code here
What type is it?
#Your code here
Convert the column to datetime object
#Your code here
Create a column that shows the day of inspection
#Your code here
#Your code here
Give it a title "Average Inspection Score by Date"
Label the axes "Date" and "Avg Inspection Score"
#Your code here
Re-set the scale of the y-axis so it starts at 75 and ends at 100. Re-graph.
#Your code here
Let's see how Score
breaks down by TypeDescription
.
Create two columns, one whose value is the mean Score
of the TypeDescription
value for that row, one whose value is the std of Score
- Groupby
TypeDescription
and calc the mean and std ofScore
- Merge with
Food
onTypeDescription
value
#Your code here
Calculate a new column that's difference between an inspections's Score
and its TypeDescription_Mean
in units of TypeDescription_Std
#Your code here
Find the values of EstablishmentName
of the 20 inspections whose Score
most exceeds its TypeDescrition_Mean
#Your code here
# SQL Connection and Querying
import sqlite3
# Data manipulation
import pandas as pd
# API Connection
import requests
# Visualization
import matplotlib.pyplot as plt
Open a connection to chinook.db
# Your code here
Select all column and rows from the genres table
# Your code here
- Select the
City
column from thecustomers
table - Select the
Name
column from thegenres
table –– aliased as "Genre" . - Create a column that counts the number of purchases made from each city for Blues music.
- Sort the results in descending order.
- Return the top ten cities.
# Your code here
- Select the
FirstName
column from thecustomers
table - Select the
LastName
column from thecustomers
table - Select the
Email
column from thecustomers
table - Create a new column that is the multiplication of the
UnitPrice
andQuantity
columns from theinvoice_items
table.- Alias this column as
Total
.
- Alias this column as
- Use
GROUP BY
to return the sum total for each customer - Sort in descending order
- Return the top 20 highest spending customers.
# Your code here
For this review, we will take a look at three separate APIs and work through the process of writing requests based on each APIs documentation.
This API provides public holiday information for more than 90 countries.
The API's Documentation can be found here
Write a request to return all available countries
# Your code here
Convert the results of our request to a DataFrame
# Your code here
What is the key for the United States?
# Your code here
Make a request to the API that returns the public holidays for the United States
# Your code here
Convert us
to a DataFrame
# Your code here
Documentation for this API can be found here
Submit a request to the iTunes API that returns data on Harry Potter Audio Books
# Your code here
Using the data from the Harry Potter Audio Books request, collect the artistId for each entry and use those IDs to make a single https://itunes.apple.com/lookup?id={}&entity=audiobooks&sort=recent
request.
To do this:
- Every id should be added to a string
- Each id should be followed by a comma. ie
id1,id2,id3,id4
- The final id should not be followed by a comma
- No id should be added to the string more than once.
# Your code here
# Run this cell!
REQUEST = 'https://itunes.apple.com/lookup?id={}&entity=audiobook&sort=recent'.format(ARTIST_IDS)
req = requests.get(REQUEST).json()
number_of_results = req['resultCount']
print('Number of results:', number_of_results)
Number of results: 123