For scraping a few random pages, use the requests library with beautifulsoup. If you need to build a web spider or crawler, use Scrapy. But if you need to scrape sites with complex logins, cookies, tokens, and data hidden by JavaScript then you will need a browser driver like Selenium.
pipenv shell
pipenv install -r requirements.txt
I wanted to be notified when Monkish Brewing made their annual drop of the phenomenal double milkshake IPA, Space Cookie. But I don't do IG so I needed another way to get the scoop.
Push notifications to my phone and computer are handled via the wonderful and free ntfy.sh service with the topic name MonkishSpaceCookie.
A cronjob runs once per day to invoke the intermediate space_cookie.sh
which runs space_cookie.py
after initializing its virtual environment (and .env
) to scrape their website in search of 'Space Cookie' among the available items. My crontab file for running daily at 11:15am with logging stdout
and stderr
going to crontab_log.txt
:
# ┌───────────── minute (0 - 59)
# │ ┌───────────── hour (0 - 23)
# │ │ ┌───────────── day of month (1 - 31)
# │ │ │ ┌───────────── month (1 - 12)
# │ │ │ │ ┌───────────── day of week (0 - 6) (Sunday to Saturday;
# │ │ │ │ │
15 11 * * * /path/to/scraping/space_cookie.sh >> /path/to/scraping/crontab_log.txt 2>&1
# An empty line is required at the end of this file for a valid cron file!
A simple adaptation of Selenium's hello_selenium.py
with some random URLs to use as a starting template.
Use Selenium to open Firefox to DuckDuckGo and perform an Internet search for python web scraping. A simple test of filling a form/field and submitting data.