Script to check whether a website has changed by comparing the previous sha256 hash with the current one.
-
Websites to be monitored are listed in a personalised
.csv
file. An example of the initial configuration can be found in the fileexample_websites.csv
. Here is the specification of the file:-
The first row must be the following:
,hash,filter,last_change_date
-
In each row insert the URL of the website to be monitored;
-
Insert two commas
,,
after the website name as the script put the most current hash after the first comma; -
If you want to monitor the whole page, skip this point. To monitor a single portion, put the
id
or theclass
of the element of the webpage you want to monitor (this is necessary as some webpages change some elements each time they are refreshed, an image for example). To do so, open the webpage and pressCTRL+shift+C
(orCMD+shift+U
on MacOS) and click with the mouse over the desired element. Then check that theclass
or theid
is unique by searching them in the webpage html code (pressCTRL+U
orCMD + U
on MacOS); -
Insert another comma
,
. Here the script will insert the last date it has checked the webpage. This is for you, little human!
-
-
If you want the output to be sent to Telegram, ask the
@BotFather
bot to create a new bot for you. Get your<telegram-token>
from the chat with@BotFather
, add your bot to a group and accesshttps://api.telegram.org/bot<telegram-token>/getUpdates
to get your<chat-id>
. -
If you want to temporary prevent a website from being checked, add a single
#
at the very beginning of its line, just before the website url.
A docker image is available here.
I've tested the image on Linux Mint 20, MacOS 10.15 and Windows 10.
For Windows 10 users the path must be specified as if it was a Linux path:
for example: /c/Users/BigWhale/Desktop
(No C:\Users\BigWhale\Desktop
).
For Linux users probably you already know, but you may need to run
the commands preceded by a sudo
.
Assuming you have a standard docker installation, you can run the image with these commands:
# get the image from the GitHub Registry
# please, check for the most recent image!
docker pull ghcr.io/robin-castellani/website-monitor/website-monitor:0.2
# run it without telegram and printing the output to the terminal (-t)
# assuming you have the <website-file.csv> in <path>, the -v flag
# is needed to map your local path to a path in the container
docker run -t \
-v <path>:<path> \
ghcr.io/robin-castellani/website-monitor/website-monitor:0.2 \
<path>/<website-file.csv>
# to run it with telegram and without printing the output to the terminal
# and without binding the terminal to the container (--detach)
docker run --detach \
-v <path>:<path> \
ghcr.io/robin-castellani/website-monitor/website-monitor:0.2 \
--token <telegram-token> --chat-id <chat-id> \
<path>/<website-file.csv>
# I suggest to add a --name <container-name> when running the container
# in this way you can inspect its logs with
# docker container logs <container-name>
# also, the --rm option remove the container once it has run
-
Install Python 3.7.7 (or Python 3.8.5) and every library in
requirements.txt
withpip install -r requirements.txt
. -
Open a terminal window (or powershell in Windows) in the repository folder (maybe you have to
cd
to your directory) and type:python main.py <path>/<website-file.csv>
Of course, replace
<path>/<website-file.csv>
with the path and the name of the file with the list of websites to monitor.
If you want to receive the results to Telegram, run
python main.py \
--token <telegram-token> --chat-id <chat-id> \
<path>/<website-file.csv>
and replace <telegram-token>
and <chat-id>
with the ones from point 2.
--repeat-every
(-r
) to repeat the check everyX
hours of your choice;--max-repetition
(-m
), only together with--repeat-every
, let you limit the maximum number of checks to perform;--verbose
(-v
) let you set the verbosity of the output to the CLI.
Some websites use JavaScript to create the webpage using your browser. Now, the website monitor script doesn't deal with browsers but it only gets the raw html without running any JavaScript. Therefore, select carefully which website you can monitor!
By the way, I'm working on this feature with Selenium.
Also, this kind of scraping is based on the immutability of the website which is being scraped; if the website changes for any reason, it is highly possible that you should check your scraping parameters (see configuration). An alternative, which in the real world is never issued by any website host, is to use the API which a website exposes and ensures a far more stable experience in getting down data. Maybe big recruiting companies have this kind of API!
For advanced users, it is possible to run the test suite, which
is located in the Test
folder.
Anyway, it runs at every push at the repository through
a GitHub Action.
All the material in this repo is available through the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license.