Command line scraper for orders and item info from webshop(s)
This project was separated from it's parent project Homelab Organizer (having outgrown it), a web-based tool (Django) for keeping control of items and tools in your homelab.
Output from finished scrapers consists of one JSON and a ZIP in a subfolder of the output/
folder.
The JSON file will follow the JSON schema defined in output/schema.json. Any extra data avaliable for the order or items will be added as keys to this file. All paths are relative to the root of the accompanying ZIP file.
- Adafruit
- Complete. Does not require login. Requires minimal manual work (download) before starting.
- Aliexpress
- Complete.
- Amazon
- Complete. Tested on
.com
,.co.uk
,.co.jp
and.de
.
- Complete. Tested on
- Polyalkemi.no
- Complete. Not much testing done.
- Komplett.no
- Complete.
- eBay
- Complete. Only tested on 31 order à 35 items.
- Kjell.com
- Complete
- Digikey
- Complete. Only tested on
.no
, needs modifications to support others. Does not require login. Requires manual work before and during item page parsing (Digikey are not Selenium friendly).
- Complete. Only tested on
- NTR
Python 3.10 or later. Should support both Linux and Windows.
Requires Firefox installed (not from snap, see instructions), and a profile setup for the scraping.
Run
firefox -p
Create a new profile, named i.e. selenum
. In the examples below you named the profile selenium
, and your username is `awesomeuser``.
Find the path to the profile, you can start looking in these paths:
- Windows:
C:\Users\awesomeuser\AppData\Roaming\Mozilla\Firefox\Profiles\SOMETHINGRANDOM.selenium1
- Linux / Mac:
/home/awesomeuser/.mozilla/firefox/SOMETHINGRANDOM.selenium1
Add this path to the WS_FF_PROFILE_PATH_WINDOWS/LINUX/DARWIN
config variable in .env.
Firefox installed as a snap on Ubuntu is not supported.
To change to a apt install on i.e. Ubuntu 22.04, read this article for omg!ubuntu.
Tested on three orders, 28 items.
- Login to https://www.adafruit.com/
- Click "Account" -> "My Account"
- Click "Order History" (https://www.adafruit.com/order_history)
- Click "Export Products CSV" and save "products_history.csv"
- Click "Export Orders CSV" and save "order_history.csv"
Run the command too see where you should put the files.
python scrape.py adafruit
python scrape.py adafruit --to-std-json
Tested on 229 orders, 409 items
Scrapes order list, item info and details to JSON, and saves a PDF copy of the Aliexpress item snapshot.
Aliexpress has a really annoying CAPTCHA that failes even if you do it manually, as long as the browser is automated.
To bypass this, open your selenium
firefox profile, and log in to Aliexpress manually before each scraping session.
Try not to resize or move the autmomated browser window while scraping. You will be prompted if you need to interract, i.e. accept a CAPTCHA. If you happen to watch and see that a page is "stuck" and not loading, you can try a quick F5.
If you want to download some and some orders, you can start with i.e. WB_ALI_ORDERS_MAX=10
,
and then increment with 10 for each run. Remember to use --use-cached-orderlist
so you do not have
to scrape the order list every time.
python scrape.py aliexpress --use-cached-orderlist
python scrape.py aliexpress --to-std-json
Only tested on two orders.
This scraper supports the arguments
--skip-item-thumb
--skip-item-pdf
--skip-order-pdf
for scraping and export.
They will skip storing the item thumbnail, item PDF print, and order invoice while scraping and exporting.
It also supports the and the option
--include-negative-orders
for export. It will include negative orders (returns) in the export.
python scraper.py polyalkemi
python scraper.py polyalkemi --to-std-json
Tested on 53 orders, 191 items.
Currently only supports the norwegian shop front. (Swedish testers welcome!)
python scraper.py kjell
python scraper.py kjell --to-std-json
Tested on TLDs (orders/items):
.de
59/210.com
12/15co.uk
8/11co.jp
2/2
# Scrape this year and archived orders orders on amazon.de
python scrape.py amazon --tld de --use-cached-orderlist
# Scrape orders from 2021 and 2023 on amazon.es
python scrape.py amazon --use-cached-orderlist --year 2021,2023 --not-archived --tld es
# Scrape all orders on amazon.co.jp from 2011 onwards, including archived orders
python scrape.py amazon --use-cached-orderlist --start-year 2022 --tld co.jp
# See help for details
python scrape.py --help
# Export scraped data for amazon.de
python scrape.py --tld de --to-std-json
Tested on 80 orders, 155 items.
Make sure to log in Komplett using the selenium
Firefox profile BEFORE starting scraping.
Komplett has a weird scrape detector, that makes Firefox give weird transport/TLS errors. If this happens, the script should tell you to clear all the profile data (only komplett.no is not enough) and re-login to Komplett.no. If this happens you should be able to continue from where you left.
python scraper.py komplett
python scrape.py komplett --to-std-json
python scraper.py digikey
python scrape.py digikey --to-std-json
python scraper.py jula
python scrape.py jula --to-std-json
Terminal:
cd /some/folder
git clone https://gitlab.com/Kagee/webshop-order-scraper.git
cd webshop-order-scraper
python3 -m venv ./venv
source ./venv/bin/activate
python ./update.py
cp example.env .env
nano .env # Edit .env to your liking
For PDF files in A4:
- Printers and scanners
- Microsoft print to PDF
- Manage
- Printer properties
- Preferences
- Advanced...
- Paper Size: A4
CMD:
cd /some/folder
git clone https://gitlab.com/Kagee/webshop-order-scraper.git # or Github Desktop/other
cd webshop-order-scraper
# Create a python virtual envirionment
venv\Scripts\activate
python ./update.py
cp example.env .env
notepad .env # Edit .env to your liking
This simple script will output stats per shop based on output files.
For steadfast bug fixing, having orders that totally scramble my scraping, and coming up with those excellent ideas when I have been struggling with a bug for an hour.
-
Am i using Firefox and not Chrome/Other?
- Efficiently printg to PDF is much easier in Firefox. Chorome does also not appear to give actual text in PDFs after printing as Firefox does.
-
Am i not using
webdriver.print_page
to get a PDF?- In testing it created redonkulously large PDFs. We are talkin 40-60 MB when printing via Mozilla/Microsoft printers created sub 10MB PDFs