This repository analyzes scraped data about Amazon, Home Depot, Lowe's, and Walmart's price competition in digital markets. The data involves power drills.
For more background information on this market see the following articles:
The main datasets file is prices.csv
located in the data
folder. There is also entry.csv
which has brick and mortar locations for Lowe's and Home Depot. For more information on the entry dataset see this kaggle page. The .sql
files in this folder can be used when prices.csv
is loaded into a sql
database. The ipython
notebooks in this repository correspond to the dataset that they use. For example, the notebooks that start with price
use the prices.csv
dataset. The ones that start with entry
use the entry.csv
dataset.
The raw scraped data in in prices.csv
. Because my data comes directly from the retailers It is possible to differentiate when the manufacturer or the retailer orders the price cut. I collected the following information using the scraper:
- prices, whether or not there was a sale (trade promotion)
- shipping prices, shipping times, delivery methods
- its rank in the search results (sorted by best seller). Also information about ads. Ads 1 means that it was sponsered, 0 means it wasn't promoted
- inventory - There are up to 3 flexible columns that are used to record information available about quantity on the website. These variables have different meanings across the various retailers.
- Amazon
- quantity 1 - number of sellers
- quantity 2 - rank in amazon department
- quantity 3 - limited stock (number left)
- Walmart
- quantity 1- quantity in store
- quantity 2 - should have a special offer
- quantity 3 - urgent quantity
- Home Depot
- quantity 1 - in store stock
- quantity 2 - limited quantity (number left)
- Lowe's
- quantity 1 - stock in store
- quantity 2 - price below retail (so, number is not shared)
- store locations.
- whether or not the product is in stock.
- The date variable is in epoch time. This is the date that the page was visited and the data was scraped.
- the actual weight of the product as a proxy for size.
The main file for preprocessing is clean_prices.py
. This creates more consistent inventory
information for each of the sellers.
calc_rank
makes rank more interpretable.calc_inven
tries to make inventory comparable accross sellers.calc_ship
tries to convert shipping information into a price and date.
Within clean_prices.py
, the function create_panel
tries to create a balanced panel with the observations. The units of the panel are week, and the product sold on a specific platform. The main variables in the panel are:
['weight', 'reviews', 'rating', 'calc_rank', 'calc_inven', 'calc_promo', 'calc_ship']