This morning we will be querying data from the Chinook Database and collecting new information with the iTunes API!
# SQL Connection and Querying
import sqlite3
# Data manipulation
import pandas as pd
# API Connection
import requests
#for tests
from test_background import test_obj_dict, run_test_dict, pkl_dump, run_test
The database in stores within the data
folder of this repo as chinook.db
In the cell below:
- Open up a connection to
chinook.db
and store that connection asconn
# Your code here
conn = None
Let's create a function called sql_df
that returns a dataframe of a SQL query.
In the cell below:
- Define a function called
sql_df
that takes in two parametersquery
, andconnection
. - Return a dataframe from the function using pd.read_sql
# Your code here
Above is the schema for the Chinook Database.
For this warmup, we will be focusing on the tracks
, albums
, and artists
tables.
We'll start with something simple for our first SQL query.
In the cell below:
- Write a SQL query to collect all columns from the
tracks
table LIMIT
the query to only return two records from the table.- Run the query through the
sql_df
function and save the results as the variablefirst_query
.
#Your code here
QUERY = None
first_query = sql_df(None)
Run the cell below to see if your query returned the correct data!
run_test(first_query, 'first_query')
Ok ok
Let's do a more complex query.
In the cell below:
- Write a sql query that selects
- The
Name
column from thetracks
table- Alias this column as
Song
- Alias this column as
- The
Title
column from thealbums
table- Alias this column as
Album
- Alias this column as
- The
Name
column from theartists
table- Alias this column as
Artist
- Alias this column as
- Use the
WHERE
command to only return results whereArtist = 'U2'
LIMIT
the results to 15 observations.
- The
Hint: This will require you to first join the tracks
and albums
tables, and then join the artists
table.
We'll save the results of this query to the variable df
# Your code here
QUERY = None
df = sql_df(QUERY, conn)
df.head()
Run the cell below to see if you returned the correct results!
run_test(df, 'query_to_df')
Song | Album | Artist |
---|---|---|
Name of song |
Name of album |
Name of artist |
Let's collect the release date of each song using the iTunes Search API and add it as a column to our dataframe
To do this, we will use three paramaters. term
, entity
, and limit
.
term
The iTunes API interprets this paramater the same way it iterprets a search term you would manually type when searching for songs on iTunes
entity
This parameter is a filter for our search. We can filter our search to only return albums, music videos, etc. For this warmup we will be filtering our search to only return songs
limit
The limit parameter determines how many search results are returned. The minimum is 50, the maximum is 200. For this warmup, we will set this to 200
The iTunes API only allows 20 requests per minute.
We could search each song individually with
req_string = 'https://itunes.apple.com/search?term={}&entity=song&limit=200'.format(song)
requests.get(req_string).json()
But what if our code malfunctions? We risk sending too many requests, at which point the iTunes API will refuse our requests! We don't want that.
Fortunately for us, we can collect the data we need with a single request.
Instead of using the name of the song as our search term, we will use the name of the artist and keep our entity parameter as song
so the results return individual songs created by the artist.
In the cell below:
- Replace
None
with the variable needed to search the iTunes API for songs by the artistU2
req_string = 'https://itunes.apple.com/search?term={}&entity=song&limit=200'.format(None)
Run the cell below to see if your req_string
variable is correct!
run_test(req_string, 'req_string')
Now that we have our req_string, we can send our request to the API using the requests
library.
req = requests.get(req_string).json()
The data returned from the API is formatted as a json
which for most intents and purposes is just a dictionary.
The information we want from this json is found with the 'results'
key.
api_data = req['results']
type(api_data)
Ok ok! Now that we have our data from the api, Here's what we need to do:
- Create an empty list called
dates
. - Loop over the rows of our dataframe
- Collect the song name and artist name from the dataframe row.
- Loop over each result in
api_data
. - Check if:
- The song in our dataframe row is equal to
result['trackName']
- The artist in our dataframe row is equal to
result['artistName']
- The song in our dataframe row is equal to
- If the song and artist match:
- Set the variable
release_date
toresult['releaseDate']
- Append
release_date
to thedates
list break
out of the nested for loop.
- Set the variable
- If the song and artist do not match:
- Set
release_date
toNone
- Set
- If the nested for loop completes and
release_date
still equalsNone
:- Append
None
to thedates
list.
- Append
- Add a
release_date
column todf
using thedates
list.
dates = []
# Your code here
Let's take a look at the results
# Run this cell as is
df
Run the cell below to check if you collected the correct release dates!
run_test(df, 'release_date')
It looks like there are several None
's in our release_date
column. And there is a reason for this.
The word the
is capitalized in our Chinook database but is lowercased by the iTunes API.
Run the cell below to see why this is a problem.
'Even Better Than The Real Thing' == 'Even Better Than the Real Thing'
Yikes! So how do we solve this?
We can solve this by lowering
all strings when we compare them so no matter what, the strings are cased exactly the same.
'Even Better Than The Real Thing'.lower() == 'Even Better Than the Real Thing'.lower()
Copy and paste your code from above in the cell below, except this time lower all string values.
# Your code here
# Run this cell as is
df
Run the cell below to see if you were successful!
run_test(df, 'string_manip')