sjdirect/abot

Robots.txt is ignored when uri scheme is changed (http/https)

Closed this issue · 1 comments

Hello,

We found Abot a few days ago, and we try its free version to see if it can meet our needs.

Everything worked fine until we noticed that it crawls urls which were 'Disallow' in robots.txt.

After some debugging, we ended up that it binds robots.txt with the initial uri scheme which is provided with the site to crawl, eg for the site https://mysite.com disallowed urls works only for https. If there is a link to http://mysite.com/somepage, Abot will ignore robots.txt and will crawl it.

*Assuming we have the following Robots.txt
User-agent: *
Disallow: /somepage

*From what I can see inside RobotsDotText.IsUrlAllowed() Uri.IsBaseOf() is used, which returns false when scheme is different...

Could you help us how to deal with this issue?
Thank you

Hi, I unfortunately cannot look into this issue anytime in the near future. The nrobots project source code is here...

https://github.com/sjdirect/nrobots

If you present a clean pull request with tests (and of course the change appears to be the correct approach) I can accept the pr and update Abot to use the new version.