/WholeFoods-Datascraping-Project-Deployment

This data is scraped from Whole Foods Featured On Sale Web Page. Features EDA, Amazon Prime to Non-Amazon Prime membership discounts on sale products as well as app deployment to show live insights.

Primary LanguageJupyter NotebookMIT LicenseMIT

Live Whole Foods 'On-Sale' Product Insights and Recommendation System Web Application

all scraped items are 'on-sale/discounted' only, if the item is not on sale for regular customers or prime members it will not be in the queried dataset

The point of this app is to help Whole Foods shoppers make better purchasing decisions at their local store to have a better shopping experience and save money, with specifics that are not on the website.

How this app's data is collected:

  • A user inputs their zipcode
  • A script scrapes unstructured product data from each category on the Whole Foods website pertaining to the user's inputted zipcode/store and then structures all of the data in a DataFrame (similar to an Excel spreadsheet)

What this app does:

  • It shows many graphs of the queried data of all 'on-sale/discounted' items from the user's store or other users' stores to understand how much of each product/category is on sale

  • It generates a shopping cart of items 'on-sale/discounted' based on user keyword input ('chocolate, pasta'...) and selected optimization parameter ('random, price, discount')

  • It recommends products to the user based on Instacart customer data using a collaborative filtering approach and the users generated shopping cart
  • For more information on the intuition behind the recommendation system click here or view the blog post

Extra app features:

  • Search the queried dataset based on keywords for anything specific
  • Download the queried dataset as a CSV

Dataset information:

  • Any user queried data gets wrangled/cleaned/manipulated to fit all edge cases for website element changes, product title mismatches and other errors that might arise when scraping product information, then structured with the following columns (any column with a * is a created feature):
feature                    dtype description
---------------------------------------------------------------------------
company                   object [product company name]
product                   object [product title]
regular                  float64 [regular product price]
sale                     float64 [on-sale product price] 
prime                    float64 [on-sale product price for prime members]
category                  object [Whole Foods category]
sale_discount            float64 [sale discount percentage] *
prime_discount           float64 [prime discount percentage] *
prime_sale_difference    float64 [prime discount - sale discount] *
discount_bins             object [discount bins I.E. 0% Off to 10% off] *

Recommendation system using collaborative filtering:

  • Recommendations are driven by parsing products into categories

    • rule-based data parsing/cleaning/lemmatization
    • word embedding (parsing) using spaCy pre-trained model
    • designing the taxonomy (categories) from scratch to have a unique signature (1400 avg items per data set --> 99 categories)
    • all of which is automated and preprocessed using a transformer with the help of sci-kit learn's BaseEstimator & TransformerMixin
  • Instacart's public datasets of 3M customer orders and other tables are joined collectively, used and built to match the taxonomy design of the signature above (99 categories)

  • Apriori algorithm is applied to the designed dataset

  • Recommendations on the app are provided to the user based on association rules of Instacart customer data as well as the input of the user

  • Recommendations are based on a random selection of a category within the top 10 confidence values (measure of the percentage of times that item in category B is purchased, given that item in category A was purchased.) this reduces bias by mitigating a selection of a category solely by the highest confidence.

image
*if a user were to have chocolate in their shopping cart, there is an equal chance that a product within the top 10 confidence values in each category of item_B is recommended

This project is deployed via Streamlit which uses a debian based linux image on their cloud, a big thanks to them for allowing many to use their platform with ease for data scientists like myself.