raviqqe/muffet

Add support for warn versus error or ignore

spkane opened this issue · 8 comments

It would be nice to categorize HTTP error codes, and URL patterns that should be reported, but not trigger an error.

That way you can report on some less critical errors, that you still might want to fix, or at least be aware of.

At the moment, because things can only be ignored (or cause a failure), you may be forced to ignore a pattern, which will also make you blind to any actual failures that crop up later with that URL, etc.

Ideally, this would build on the feature in #291, but it could be done initially with just the pattern arguments.

To build on this a bit it would be really flexible, if I could set this globally or per pattern so that I could do something for www.unix.com which would allow me to ignore, or only warn on the 403 it always responds with, but still error on a 404, for that particular site. LinkedIn reports 999 on public profiles, for whatever reason, so that is another useful example.

--exclude {name=www.unix.com/man-page/linux, ignore=403, warn=308}
--exclude {name=linkedin.com/in/, ignore=999 }

*** INFO: [2023-03-09 17:48:22] Start checking: "https://example.com"
https://example.com/journal/unix-programming/
	403 (following redirect https://www.unix.com/man-page/linux/5/init/)	http://www.unix.com/man-page/linux/5/init/
*** ERROR: [2023-03-09 17:48:47] Something went wrong - see the errors above...

What kinds of status codes do you want to mark as warnings? For example, is reducing 308 for SEO?

As an example, one might want to know about a redirect so that it can eventually be fixed, without it actually throwing an error and therefore breaking a deployment of a website change.

What is the size of your website? For example, how many pages and links does it have roughly?

It is not huge, but we do have a lot of long technical blog articles, that tend to link out to other sites, whose links and general behavior are more likely to change or become invalid over time.

I could see value in being able to pass this information in via a config file when there are a lot of rules, in addition to simply supplying a few options on the command line when the rules are very simple.

I want to bump this issue/idea. I have a Hugo site with about 4500 links that I check via Gitlab CI. Basically everytime I add a new blog post the CI tests break and I need to update my exclude-list. Currently, the script looks like below with the ... meaning many more --exclude lines.

#!/bin/bash

LOCAL_HOST="http://localhost:1313/links/"
MAX_WAIT_TIME=60 # 30 sec
OPTIONS="--exclude 'reddit.com' \
         --exclude 'anaconda.org' \
         --exclude 'arxiv.org' \
         --exclude 'docker.com' \
         --exclude 'stackoverflow.com' \
         --exclude 'linuxize.com' \
         --exclude 'cyberciti.biz' \
         --exclude 'gitlab.yourgitlab.com' \
         --exclude 'openai.com' \
         --exclude '^*.webm$' \
         ...
         --ignore-fragments \
         --max-response-body-size 100000000 \
         --junit > rspec.xml"

for i in $(seq 0 ${MAX_WAIT_TIME}); do # 5 min
    sleep 0.5
    IS_SERVER_RUNNING=$(curl -LI ${LOCAL_HOST} -o /dev/null -w '%{http_code}' -s)
    if [[ "${IS_SERVER_RUNNING}" == "200" ]]; then
        eval muffet "${OPTIONS}" ${LOCAL_HOST} && exit 0 || exit 1
    fi
done

echo "error: time out $((${MAX_WAIT_TIME}/2)) sec" && exit 1