HTML base element
Thyra opened this issue · 1 comments
Thyra commented
I found another thing that has to be considered when crawling a website: The HTML base element. It changes the address relative hrefs are relative to.
vezaynk commented
This is an interesting case of which I was not aware.
This line currently uses the parent url to resolve relative urls. A simple regex to attempt to extract the base url should be easy enough.
But like with all things that seem easy, we get a bunch of edge cases!
"Absolute and relative URLs are allowed."
I can't fathom why someone would use a relative URL for this. I will probably handle the absolute case first and open a new issue for the relative one after.