JustinBeckwith/linkinator

Relative urls and redirects issue

Opened this issue · 2 comments

For example url https://www.sberbank.ru/ru/person/seizure redirected to https://www.sberbank.ru/seizure and have relative urls in there like ./1142.

If we crawl /seizure directly all this urls are OK. But when we start scanning with /ru/person/seizure all relative urls incorrect prefixed with before-redirected url like /ru/person/seizure/1142 and mark as broken.

Also I think <base href=" tag don't taken into account when URL is buildng.

Cannot be done without changes in gaxios (referenced PR). If real page URL will be in response this bug can be solved with changing opts.url to res.request.responseURL in index.js:149.

Also it can be another feature. Crawler result json can contains information about page links that are redirects. There are many cases when it can be usefull:

  • http links to sites that fully upgraded to https
  • links without www.
  • redirects can lead to not the same page than before
  • and others