vezaynk/Sitemap-Generator-Crawler

HTML base element

Thyra opened this issue · 1 comments

Thyra commented

I found another thing that has to be considered when crawling a website: The HTML base element. It changes the address relative hrefs are relative to.

This is an interesting case of which I was not aware.

This line currently uses the parent url to resolve relative urls. A simple regex to attempt to extract the base url should be easy enough.

But like with all things that seem easy, we get a bunch of edge cases!

"Absolute and relative URLs are allowed."

I can't fathom why someone would use a relative URL for this. I will probably handle the absolute case first and open a new issue for the relative one after.