aim42/htmlSanityCheck

Bug: Amazon flaky results for unknown URLs

Opened this issue · 1 comments

Amazon seems to behave differently for unknown URLs depending on misc. request parameters.
Currently I run into test errors with the test case BrokenHttpLinksCheckerSpec:bad amazon link is identified as problem.
It seems to work in GitHub actions but fails on my local machine, either from single test execution from IDE (IntelliJ) as well as from a full gradlew test run.

I could track it down to the following behaviour:

  • When executed locally, Amazon returns a status 200 and requires a captcha resolution. The test case requires a 503 return code which results in a finding found by the HSC checker.
  • When executed in GitHub it seems to work as expected, returning a 503 (unfortunately we do not yet have some logging of results available).

Locally I could further change the behaviour of Amazon by setting the User-Agent header of the request.
This could even be implemented with curl

  • curl -X HEAD -v https://www.amazon.com/dp/4242424242 uses curl's default User-Agent (curl/8.4.0 in my case) and returns a 503 (the same holds true for GET requests)
  • Using curl with the default HSC User-Agent header "Mozilla/5.0 (X11; Linux i686; rv:10.0) Gecko/20100101 Firefox/10.0": curl -H "User-Agent: Mozilla/5.0 (X11; Linux i686; rv:10.0) Gecko/20100101 Firefox/10.0" -X GET -v https://www.amazon.com/dp/4242424242 returns a status 200 and a captcha request
image

Cf. bug-316.zip

Perhaps this is similar to the the behaviour we see in #219?

I suggest to set the User-Agent header to something HSC specific (e.g, hsc/version).

For whatever reason the problem mostly occurs locally (but seldomly also during GitHub action build).