This project automates the scraping and storing at regular intervals of grocery product information from ALDI's website, including the full inventory and items included in the supermarket chain's "Aisle of Shame" or "finds" category. It was inspired by Parija Kavilanz's CNN story.
The project — and its documentation — are still a work in progress.
The project utilizes two scripts: fetch_all_products.py
for capturing the full product catalog using the supermarket's API and fetch_aisle_products.py
for weekly updates from the "finds" pages. It's updated weekly and snapshots of past files are stored on AWS S3.
-
Prerequisites:
- Python 3.8+ installed on your machine.
- Necessary Python packages:
requests
,pandas
,tqdm
for full catalog collection;beautifulsoup4
,boto3
for weekly specials.
-
Setup:
pip install requests pandas tqdm beautifulsoup4 boto3
-
Execution:
python fetch_all_products.py # For full catalog collection python fetch_aisle_products.py # For weekly "finds" updates
These scripts can be executed automatically via GitHub Actions, ensuring regular data updates without manual intervention.
The data is stored in CSV and JSON files with configuration for upload to AWS S3 or other specified services. The outputs include:
Field Name | Description | Example |
---|---|---|
sku |
Stock Keeping Unit identifier | "0000000000000005" |
name |
Product name | "Original Kettle Chips, 8 oz" |
brand_name |
Product brand name | "Clancy's" |
price_unit |
Product price unit | "oz" |
price |
Product price | "$7.99" |
description |
Description of the product | "Complete a tasty lunch with a handful ..." |
categories |
Product categories (parent, children) | "Snacks, Chips, Crackers & Popcorn" |
country_origin |
Product origin country | "USA and Imported" |
snap_eligible |
Supplemental Nutrition Assistance Program eligible | "True" |
formatted_price |
Formatted price with currency symbol | "$2.15" |
slug |
Slug for product URL | "clancy-s-original-kettle-chips-8-oz" |
image_url |
URL for product image | "https://dm.cms.aldi.cx/is/image/...{slug}" |
warning_code |
Health warnings in CA | "ca_prop_65" |
warning_desc |
Description of health warning | "Consuming this product can expose you to ..." |
-
Weekly finds output (via
fetch_aisle_products.py
): JSON | CSVField Name Description Example week_date
The date range for which the products are listed "04/17/24 - 04/23/24" category
Category of the product "Home Goods" brand
Brand of the product "Huntington Home" description
Description of the product "Luxury 2 Wick Candle" price
Price of the product "$4.99*" image
URL of the product image "https://www.aldi.us/.../candle.jpg" link
URL to the product detail page "https://www.aldi.us/.../candle/" week_start
Start date of the week for the product listing "04/17/24" week_end
End date of the week for the product listing "04/23/24" price_clean
Numeric price of the product 4.99
The GitHub Actions workflows (fetch_all_products.yml
for the full catalog and fetch_aisle_products.yml
for the weekly finds) automate the execution of the scraping scripts on a regular basis (overnight on Sundays).
These workflows include steps to set up Python, install dependencies, execute the scripts and handle data storage.
- Set up Python: Installs Python and the required packages.
- Run scraper: Executes the Python scripts to fetch data and generate the output files.
- Upload data: The CSV files are uploaded to an AWS S3 bucket or another configured storage service.
Contributions to this project are welcome, especially in the following areas:
- Code enhancements: Suggestions for improving the scraping efficiency or expanding data collection features.
- Documentation: Updates to README or additional guidelines for users.
- Feature requests: Ideas for new features or data points to collect or ways to visualize what's been collected.
This project is released under the MIT License. See the LICENSE file for more details.