We've seen how the Yelp API works and how to create basic visualizations using Folium. It's time to put those skills to work in order to create a working map! Taking things a step further, you'll also independently explore how to perform pagination in order to retrieve a full results set from the Yelp API.
You will be able to:
- Practice using functions to organize your code
- Use pagination to retrieve all results from an API query
- Practice parsing data returned from an API query
- Practice interpreting visualizations of a dataset
- Create maps using Folium
Photo by Jordan Madrid on Unsplash
You've now worked with some API calls, but we have yet to see how to retrieve a more complete dataset in a programmatic manner. In this lab, you will write a query of businesses on Yelp, then use pagination to retrieve all possible results for that query. Then you will create a summary of your findings, including a Folium map of the geographic locations of those businesses.
Returning to the Yelp API, the documentation also provides us details regarding the API limits. These often include details about the number of requests a user is allowed to make within a specified time limit and the maximum number of results to be returned. In this case, we are told that any request has a maximum of 50 results per request and defaults to 20. Furthermore, any search will be limited to a total of 1000 results. To retrieve all 1000 of these results, we would have to page through the results piece by piece, retrieving 50 at a time. Processes such as these are often referred to as pagination.
Also, be mindful of the API rate limits. You can only make 5000 requests per day and are also can make requests too fast. Start prototyping small before running a loop that could be faulty. You can also use time.sleep(n)
to add delays. For more details see https://www.yelp.com/developers/documentation/v3/rate_limiting.
In this lab, you will define a search and then paginate over the results to retrieve all of the results. You'll then parse these responses as a list of dictionaries (for further exploration) and create a map using Folium to visualize the results geographically.
Start by filling in your API key to make the initial request to the business search API. Investigate the structure of the response you get back and start figuring out how you will extract the relevant information.
Using loops and functions, collect the maximum number of results for your query from the API.
Interpret visualizations related to the price range, average rating, and number of reviews for all query results.
Using latitude and longitude data, plot the query results on an interactive map.
Start by making an initial request to the Yelp API. Your search must include at least 2 parameters: term and location. For example, you might search for pizza restaurants in NYC. The term and location is up to you but make the request below.
Use the requests
library (documentation here).
You'll also need an API key from Yelp. If you haven't done this already, go to the Yelp Manage App page and create a new app (after making an account if you haven't already).
# Replace None with appropriate code
# Import the requests library
None
# Get this from the "Manage App" page. Make sure you set them
# back to None before pushing this to GitHub, since otherwise
# your credentials will be compromised
api_key = None
# These can be whatever you want! But the solution uses "pizza"
# and "New York NY" if you want to compare your work directly
term = None
location = None
# Set up params for request
url = "https://api.yelp.com/v3/businesses/search"
headers = {
"Authorization": "Bearer {}".format(api_key)
}
url_params = {
"term": term.replace(" ", "+"),
"location": location.replace(" ", "+")
}
# Make the request using requests.get, passing in
# url, headers=headers, and params=url_params
response = None
# Confirm we got a 200 response
response
# Run this cell without changes
# Get the response body in JSON format
response_json = response.json()
# View the keys
response_json.keys()
Now, retrieve the value associated with the 'businesses'
key, and inspect its contents.
# Replace None with appropriate code
# Retrieve the value from response_json
businesses = None
# View the first 2 records
businesses[:2]
Write a function prepare_data
that takes in a list of dictionaries like businesses
and returns a copy that has been prepared for analysis:
- The
coordinates
key-value pair has been converted into two separate key-value pairs,latitude
andlongitude
- All other key-value pairs except for
name
,review_count
,rating
, andprice
have been dropped - All dictionaries missing one of the relevant keys or containing null values have been dropped
In other words, the final keys for each dictionary should be name
, review_count
, rating
, price
, latitude
, and longitude
.
Complete the function in the cell below:
# Replace None with appropriate code
def prepare_data(data_list):
"""
This function takes in a list of dictionaries and prepares it
for analysis
"""
# Make a new list to hold results
results = []
for business_data in data_list:
# Make a new dictionary to hold prepared data for this business
prepared_data = {}
# Extract name, review_count, rating, and price key-value pairs
# from business_data and add to prepared_data
# If a key is not present in business_data, add it to prepared_data
# with an associated value of None
None
# Parse and add latitude and longitude columns
None
# Add to list if all values are present
if all(prepared_data.values()):
results.append(prepared_data)
return results
# Test out function
prepared_businesses = prepare_data(businesses)
prepared_businesses[:5]
Check that your function created the correct keys:
# Run this cell without changes
assert sorted(list(prepared_businesses[0].keys())) == ['latitude', 'longitude', 'name', 'price', 'rating', 'review_count']
The following code will differ depending on your query, but we expect there to be 20 businesses in the original list, and potentially fewer in the prepared list (if any of them were missing data):
# Run this cell without changes
print("Original:", len(businesses))
print("Prepared:", len(prepared_businesses))
Great! We will reuse this function once we have retrieved the full dataset.
Now that you are able to extract information from one page of the response, let's figure out how to request as many pages as possible.
Depending on the number of total results for your query, you will either retrieve all of the results, or just the first 1000 (if there are more than 1000 total).
We can find the total number of results using the "total"
key:
# Run this cell without changes
response_json["total"]
(This is specific to the implementation of the Yelp API. Some APIs will just tell you that there are more pages, or will tell you the number of pages total, rather than the total number of results. If you're not sure, always check the documentation.)
In the cell below, assign the variable total
to either the value shown above (if it is less than 1000), or 1000.
# Replace None with appropriate code
total = None
The documentation states in the parameters section:
Name:
limit
, Type: int, Description: Optional. Number of business results to return. By default, it will return 20. Maximum is 50.
Name:
offset
, Type: int, Description: Optional. Offset the list of returned business results by this amount.
So, to get the most results with the fewest API calls we want to set a limit of 50 every time. If, say, we wanted to get 210 total results, that would mean:
- Offset of
0
(first 50 records) - Offset of
50
(second 50 records) - Offset of
100
(third 50 records) - Offset of
150
(fourth 50 records) - Offset of
200
(final 10 records)
In the cell below, create a function get_offsets
that takes in a total and returns a list of offsets for that total. You can assume that there is a limit of 50 every time.
Hint: you can use range
(documentation here) to do this in one line of code. Just make the returned result is a list.
# Replace None with appropriate code
def get_offsets(total):
"""
Get a list of offsets needed to get all pages
of data up until the total
"""
None
Check that your function works below:
# Run this cell without changes
assert get_offsets(200) == [0, 50, 100, 150]
assert get_offsets(210) == [0, 50, 100, 150, 200]
Recall that the following variable has already been declared for you:
# Run this cell without changes
url_params
We'll go ahead and also specify that the limit should be 50 every time:
# Run this cell without changes
url_params["limit"] = 50
In order to modify the offset, you'll need to add it to url_params
with the key "offset"
and whatever value is needed.
In the cell below, write code that:
- Creates an empty list for the full prepared dataset
- Loops over all of the offsets from
get_offsets
and makes an API call each time with the specified offset - Calls
prepare_data
to get a cleaned version of the result of each API call - Extends the full prepared dataset list with each query's prepared dataset
# Replace None with appropriate code
# Create an empty list for the full prepared dataset
full_dataset = None
for offset in get_offsets(total):
# Add or update the "offset" key-value pair in url_params
None
# Make the query and get the response
response = requests.get(url, headers=headers, params=url_params)
# Get the response body in JSON format
response_json = None
# Get the list of businesses from the response_json
businesses = None
# Call the prepare_data function to get a list of processed data
prepared_businesses = None
# Extend full_dataset with this list (don't append, or you'll get
# a list of lists instead of a flat list)
None
# Check the length of the full dataset. It will be up to `total`,
# potentially less if there were missing values
len(full_dataset)
This code may take up to a few minutes to run.
If you get an error trying to get the response body in JSON format, try adding time.sleep(1)
right after the requests.get
line, so your code will sleep for 1 second between each API call.
Take the businesses from the previous question and do an initial exploratory analysis. We have provided some plots for you to interpret:
# Run this cell without changes
from collections import Counter
import matplotlib.pyplot as plt
%matplotlib inline
fig, (ax1, ax2, ax3) = plt.subplots(ncols=3, figsize=(16, 5))
# Plot distribution of number of reviews
all_review_counts = [x["review_count"] for x in full_dataset]
ax1.hist(all_review_counts)
ax1.set_title("Review Count Distribution")
ax1.set_xlabel("Number of Reviews")
ax1.set_ylabel("Number of Businesses")
# Plot rating distribution
all_ratings = [x["rating"] for x in full_dataset]
rating_counter = Counter(all_ratings)
rating_keys = sorted(rating_counter.keys())
ax2.bar(rating_keys, [rating_counter[key] for key in rating_keys])
ax2.set_title("Rating Distribution")
ax2.set_xlabel("Rating")
ax2.set_ylabel("Number of Businesses")
# Plot price distribution
all_prices = [x["price"].replace("$", r"\$") for x in full_dataset]
price_counter = Counter(all_prices)
price_keys = sorted(price_counter.keys())
ax3.bar(price_keys, [price_counter[key] for key in price_keys])
ax3.set_title("Price Distribution")
ax3.set_xlabel("Price Category")
ax3.set_ylabel("Number of Businesses");
Describe the distributions displayed above and interpret them in the context of your query. (Your answer may differ from the solution branch depending on your query.)
# Replace None with appropriate text
"""
None
"""
In the cell below, we also plot the rating distributions by price. In this setup, a price of one dollar sign is "lower price" and everything else is "higher price".
# Run this cell without changes
higher_price = []
lower_price = []
for row in full_dataset:
if row["price"] == "$":
lower_price.append(row["rating"])
else:
higher_price.append(row["rating"])
fig, ax = plt.subplots()
ax.hist([higher_price, lower_price], label=["higher price", "lower price"], density=True)
ax.legend();
Is a higher price associated with a higher rating? (No need for any additional math/statistics, just interpret what you see in the plot.)
# Replace None with appropriate text
"""
None
"""
Finally, let's look at ratings vs. review counts:
# Run this cell without changes
fig, ax = plt.subplots(figsize=(16,5))
ax.scatter(all_review_counts, all_ratings, alpha=0.2)
ax.set_xlabel("Number of Reviews")
ax.set_ylabel("Rating")
# "zoom in" to a subset of review counts
ax.set_xlim(left=0, right=1000);
Is a higher number of reviews associated with a higher rating?
# Replace None with appropriate text
"""
None
"""
Make a map using Folium of the businesses you retrieved. Be sure to also add popups to the markers giving some basic information such as name, rating and price.
You can center the map around the latitude and longitude of the first item in full_dataset
.
# Replace None with appropriate code
# Import the library
None
# Set up center latitude and longitude
center_lat = None
center_long = None
# Initialize map with center lat and long
yelp_map = None
# Adjust this limit to see more or fewer businesses
limit=100
for business in full_dataset[:limit]:
# Extract information about business
lat = None
long = None
name = None
rating = None
price = None
details = "{}\nPrice: {} Rating:{}".format(name,price,rating)
# Create popup with relevant details
popup = None
# Create marker with relevant lat/long and popup
marker = None
marker.add_to(yelp_map)
yelp_map
Nice work! In this lab, you've made multiple API calls to Yelp in order to paginate through a results set, performing some basic exploratory analysis and then creating a nice interactive map to display the results using Folium! Well done!