scope creep - crawling beyond the target site into other sites on the domain

Question

scope creep - crawling beyond the target site into other sites on the domain

mavensecurity opened this issue 10 years ago · 2 comments

It seems like this is a feature. Give it -u http://a.example.com and if there is link to http://b.example.com then xsscrapy follows and tests it. But IMO that is a big mistake (as a default setting). I want to test QA not production, and sometime (often) a QA site has links to production. So I try to scan JUST http://qa.example.com and xsscrapy ends up going to http://www.example.com. Now I've just sent traffic to production. Not an ideal situation.

My fix:
Edit xsscrapy/spiders/xss_spiders.py to modify self.allowed_domains to be:
self.allowed_domains = [hostname]

Answer 1 · 2014-10-10T03:43:20.000Z

You know I've been a bit confused by that as well. The docs seem to indicate that simply setting the hostname as the allowed_domain like example.com should ONLY search example.com but I have found that is not the case. Like if I do what you did with self.allowed_domains = [hostname] then it still finds subdomains. If I wanted to exclude subdomains it seems like the only way is to set a rule.

rules = (Rule(LinkExtractor(deny=('\.domain.com')), callback='parse_resp', follow=True), )

Which works, but rules are called prior to init so I haven't figured out how to modify the domain within the rule given a script arg yet.

Answer 2 · 2014-10-10T13:39:59.000Z

But you're saying if you just use the [hostname] then at least if you specify a subdomain it'll stay in the subdomain. I see now. Yes that makes sense. I updated.