This is a Python console application for checking the status of all the URLs of a static Jekyll site.
The primary use-cases are:
- Identifying dead links (the resource may have moved or disappeared after a post was written)
- Identifying which links can be upgraded from HTTP to HTTPS (the host may have acquired HTTPS support after a post was written)
These checks come in handy for general blog maintenance.
Run the https_sentry.py
script with arguments as described:
Usage: https_sentry <files directory> <optional: config file>
If no config file is specified https_sentry will look
for "config.yaml" in the current working directory.
- Multi-threading for file reads and link searches, network requests, and file writes
- URL cache to prevent duplicate network requests
- Flexible configuration
- Python 3
- PyYAML
This is a sample config.yaml
file:
# Protocols ###################################################################
# The protocol of the links to search for
protocol: http
# The protocol to use instead for the found links
upgrade_protocol: https
# Upgrading ###################################################################
# Whether to perform the protocol substitution
upgrade: False
# Whether to write the changes to disk for the successfully reached links
# Setting this to false will ensure a 'dry run'
upgrade_save: False
# Output ######################################################################
print_only_errors: False
# HTTP config #################################################################
user_agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0
# HEAD will make the server send only the header (saving time and bandwidth),
# while a response to GET will also include the body
method: HEAD
# URL regex ###################################################################
# The regex used to match URLs, without the protocol prefix (no http(s)://)
# https_sentry will prepend the appropriate prefix during the crawl
# This URL-matching regex was obtained from https://stackoverflow.com/a/3809435
url_regex: (www\.)?[-a-zA-Z0-9@:%._\+~#=]{2,256}\.[a-z]{2,4}\b([-a-zA-Z0-9@:%_\+.~#?&//=]*)
# Threads #####################################################################
# Reads files and finds the links
crawler_threads: 4
# Checks the links - this is usually the bottleneck
net_checker_threads: 16
# Prints output and saves the files (if upgrade_save is True)
printer_threads: 4
- Set
protocol
tohttp
- Set
upgrade
toFalse
- Set
protocol
tohttps
- Set
upgrade
toFalse
- Set
protocol
tohttp
- Set
upgrade_protocol
tohttps
- Set
upgrade
toTrue
- Set
protocol
tohttp
- Set
upgrade_protocol
tohttps
- Set
upgrade
toTrue
- Set
upgrade_save
toTrue
- Better docstrings
- Handle incomplete config file
- Better config file validation
- Automated tests