decrypto-org/spider

No relative links captured

Closed this issue · 1 comments

Our current approach with only the onion regex does not match any relative URIs. I should add another regex, which matches specifically to href="/" type of URLs. This should allow us to find any local URLs. We further pass the base URL to the extraction function, such that one can build a complete response from relative URIs.

You're right. In addition, this will require parsing the <base /> HTML tag using cheerio – whether this tag is widely used across the darknet I don't know.