Htmlproofer failing in CI runs
SethTisue opened this issue · 15 comments
at e.g. https://github.com/scala/scala-lang/runs/6236625743 we see:
htmlproofer 3.10.2 | Error: HTML-Proofer found 3 failures!
[12](https://github.com/scala/scala-lang/runs/6236625743?check_suite_focus=true#step:6:12)
- ./_site/blog/2017/08/28/gsoc-connecting-contributors-with-projects.html
[13](https://github.com/scala/scala-lang/runs/6236625743?check_suite_focus=true#step:6:13)
* External link https://developer.github.com/v3/ failed: 403 No error
[14](https://github.com/scala/scala-lang/runs/6236625743?check_suite_focus=true#step:6:14)
* External link https://developer.github.com/v4/ failed: 403 No error
[15](https://github.com/scala/scala-lang/runs/6236625743?check_suite_focus=true#step:6:15)
- ./_site/blog/2018/06/04/scalac-profiling.html
[16](https://github.com/scala/scala-lang/runs/6236625743?check_suite_focus=true#step:6:16)
* External link https://docs.github.com/en/authentication/connecting-to-github-with-ssh failed: 403 No error
[17](https://github.com/scala/scala-lang/runs/6236625743?check_suite_focus=true#step:6:17)
Error: Process completed with exit code 1.
I've updated two of those links in #1375, because they changed. Not sure why we're getting a 403 Forbidden error though. Maybe the ip where the build is running from is blocked?
it's strange, I don't know what to make of it
after #1375, remaining failures are:
- ./_site/blog/2017/08/28/gsoc-connecting-contributors-with-projects.html
* External link https://docs.github.com/en/graphql failed: 403 No error
* External link https://docs.github.com/en/rest failed: 403 No error
htmlproofer 3.10.2 | Error: HTML-Proofer found 3 failures!
- ./_site/blog/2018/06/04/scalac-profiling.html
* External link https://docs.github.com/en/authentication/connecting-to-github-with-ssh failed: 403 No error
How odd. I tried to curl https://docs.github.com/en/rest and I also got a 403.
I was able to get a 200 by adding to the request an Accept-Encoding header that explicitly specified at least one compression algorithm, e.g. it liked Accept-Encoding: gzip, identity but not Accept-Encoding: identity or Accept-Encoding: *
Not sure what to make of that, maybe the server has been configured to only send compressed responses?
I don't know how htmlproofer works or what request headers it sends.
latest run: https://github.com/scala/scala-lang/runs/6250840445
and... there is a massive amount of 403s :-(
not sure what to make of that. like do we revert #1376 (and #1378) because it seems to have made matters worse?
or perhaps it's just because there were several runs in close succession and so we're getting rate-limited? normally the cron job only runs once/day
let's see how the next cron run does
I think it made things worse. :(
Still tons of 403s at https://github.com/scala/scala-lang/runs/6282395653 :-/
I ran the check on a single directory locally:
With the recent Accept-Encoding "fix".
$ bundle exec htmlproofer ./_site/blog/2017/08/28/ --external_only --only-4xx --http-status-ignore "400,401,429" --empty-alt-ignore --allow-hash-href --url-ignore "/trends.google.com/,/pgp.mit.edu/,/www.oracle.com/,/scalafiddle.io/" --typhoeus-config='{"headers":{"Accept-Encoding":"gzip, deflate"}}'
Running ["LinkCheck", "ImageCheck", "ScriptCheck"] on ["./_site/blog/2017/08/28/"] on *.html...
Checking 37 external links...
Ran on 1 file!
- ./_site/blog/2017/08/28/gsoc-connecting-contributors-with-projects.html
* External link https://index.scala-lang.org failed: 403 No error
* External link https://index.scala-lang.org/ failed: 403 No error
* External link https://index.scala-lang.org/search?q=&contributingSearch=true failed: 403 No error
HTML-Proofer found 3 failures!
and without:
$ bundle exec htmlproofer ./_site/blog/2017/08/28/ --external_only --only-4xx --http-status-ignore "400,401,429" --empty-alt-ignore --allow-hash-href --url-ignore "/trends.google.com/,/pgp.mit.edu/,/www.oracle.com/,/scalafiddle.io/"
Running ["LinkCheck", "ImageCheck", "ScriptCheck"] on ["./_site/blog/2017/08/28/"] on *.html...
Checking 37 external links...
Ran on 1 file!
- ./_site/blog/2017/08/28/gsoc-connecting-contributors-with-projects.html
* External link https://docs.github.com/en/graphql failed: 403 No error
* External link https://docs.github.com/en/rest failed: 403 No error
HTML-Proofer found 2 failures!
Which at least reproduces what we're seeing in CI on the GitHub Actions runner.
But why do some sites fail without the Accept-Encoding header and others fail with it? Using curl seems to work fine here on all with the header set. 🤷 I guess if I'm in the mood for a puzzle later I'll take a look.
Ah, so the problem seems to be that specifying --typhoeus-config on the command line discards all of the html-proofer default Typhoeus configuation. Mildly annoying that it doesn't just perform a dictionary update. So all the defaults need to be re-specified on the command line (as appropriate, of course). The defaults appear to be these:
TYPHOEUS_DEFAULTS = {
followlocation: true,
headers: {
'User-Agent' => "Mozilla/5.0 (compatible; HTML Proofer/#{HTMLProofer::VERSION}; +https://github.com/gjtorikian/html-proofer)",
'Accept' => 'application/xml,application/xhtml+xml,text/html;q=0.9, text/plain;q=0.8,image/png,*/*;q=0.5'
},
connecttimeout: 10,
timeout: 30
}I tried a full htmlproofer run locally with all these settings + the Accept-Encoding header and it succeeded.
$ bundle exec htmlproofer ./_site/ --external_only --only-4xx --http-status-ignore "400,401,429" --empty-alt-ignore --allow-hash-href --url-ignore "/trends.google.com/,/pgp.mit.edu/,/www.oracle.com/,/scalafiddle.io/" --typhoeus-config='{"headers":{"Accept-Encoding":"gzip, deflate", "Accept":"application/xml,application/xhtml+xml,text/html;q=0.9, text/plain;q=0.8,image/png,*/*;q=0.5", "User-Agent":"Mozilla/5.0 (compatible; HTML Proofer/#{HTMLProofer::VERSION}; +https://github.com/gjtorikian/html-proofer)"}, "followlocation":"true", "connecttimeout":"10", "timeout":"30"}'
Running ["LinkCheck", "ImageCheck", "ScriptCheck"] on ["./_site/"] on *.html...
Checking 4974 external links...
Ran on 367 files!
HTML-Proofer finished successfully.
merged... we'll see what happens in the next scheduled run...
- ./_site/2019/12/18/road-to-scala-3.html
[10](https://github.com/scala/scala-lang/runs/6315550685?check_suite_focus=true#step:6:10)
[11](https://github.com/scala/scala-lang/runs/6315550685?check_suite_focus=true#step:6:11)
[12](https://github.com/scala/scala-lang/runs/6315550685?check_suite_focus=true#step:6:12)
* External link https://docs.scala-lang.org/scala3/reference/metaprogramming.html failed: 404 No error
[13](https://github.com/scala/scala-lang/runs/6315550685?check_suite_focus=true#step:6:13)
- ./_site/2020/11/06/explicit-term-inference-in-scala-3.html
[14](https://github.com/scala/scala-lang/runs/6315550685?check_suite_focus=true#step:6:14)
* External link https://docs.scala-lang.org/scala3/reference/contextual.html failed: 404 No error
[15](https://github.com/scala/scala-lang/runs/6315550685?check_suite_focus=true#step:6:15)
- ./_site/community/index.html
[16](https://github.com/scala/scala-lang/runs/6315550685?check_suite_focus=true#step:6:16)
* External link https://groups.google.com/g/scala-announce failed: 403 No error
[17](https://github.com/scala/scala-lang/runs/6315550685?check_suite_focus=true#step:6:17)
* External link https://groups.google.com/g/scala-tools failed: 403 No error
- ./_site/2019/12/18/road-to-scala-3.html
- External link https://docs.scala-lang.org/scala3/reference/metaprogramming.html failed: 404 No error
- ./_site/2020/11/06/explicit-term-inference-in-scala-3.html
- External link https://docs.scala-lang.org/scala3/reference/contextual.html failed: 404 No error
Those are due to the new reference documentation (https://docs.scala-lang.org/scala3/reference). They would be fixed by scala/scala3#15118
Run of last night succeeded: https://github.com/scala/scala-lang/runs/6330776484
thanks all for the group effort here!