Path allowed despite Disallow for *
petermeissner opened this issue · 3 comments
Hey,
I have written an R based robots.txt parser (https://github.com/ropenscilabs/robotstxt). @hrbrmstr wrapped this library and suggested using it for big-x speedup (https://github.com/hrbrmstr/spiderbar).
Related issue: hrbrmstr/spiderbar#2
Now I have run my test cases against my implementation and against those wrapping rep-cpp and found a divergence which I think is a bug on your side. Consider the following example:
Consider the following robots.txt file:
User-agent: UniversalRobot/1.0
User-agent: mein-Robot
Disallow: /quellen/dtd/
User-agent: *
Disallow: /unsinn/
Disallow: /temp/
Disallow: /newsticker.shtml
In the example there are some directories forbidden for all robots e.g. /temp/
but when using rep-cpp for permission checking the path is indicated as ok when checking for bot mein-Robot which I am quite sure should not be the case. (rep-cpp is used for those function calls where check_method="spiderbar"
)
library(robotstxt)
rtxt <- "# robots.txt zu http://www.example.org/\n\nUser-agent: UniversalRobot/1.0\nUser-agent: mein-Robot\nDisallow: /quellen/dtd/\n\nUser-agent: *\nDisallow: /unsinn/\nDisallow: /temp/\nDisallow: /newsticker.shtml"
paths_allowed(
paths = "/temp/some_file.txt",
robotstxt_list = list(rtxt),
check_method = "robotstxt",
bot = "*"
)
#> [1] FALSE
paths_allowed(
paths = "/temp/some_file.txt",
robotstxt_list = list(rtxt),
check_method = "spiderbar",
bot = "*"
)
#> [1] FALSE
paths_allowed(
paths = "/temp/some_file.txt",
robotstxt_list = list(rtxt),
check_method = "robotstxt",
bot = "mein-Robot"
)
#> [1] FALSE
paths_allowed(
paths = "/temp/some_file.txt",
robotstxt_list = list(rtxt),
check_method = "spiderbar",
bot = "mein-Robot"
)
#> [1] TRUE
Thanks for writing in!
The section about User-Agent
in the original RFC says:
These name tokens are used in User-agent lines in /robots.txt to identify to which specific robots the record applies. The robot must obey the first record in /robots.txt that contains a User-Agent line whose value contains the name token of the robot as a substring. The name comparisons are case-insensitive. If no such record exists, it should obey the first record with a User-agent line with a "*" value, if present. If no record satisfied either condition, or no records are present at all, access is unlimited.
That's far from decisive, but my interpretation of that is that the current behavior is correct - it follows the most specific stanza for the particular bot. It's pretty common for a robots.txt
to group bots together with a set or rules or repeat that set of rules explicitly for each of a number of agents.
While Google is not exactly a standard, it has this to say on the matter:
The start-of-group element user-agent is used to specify for which crawler the group is valid. Only one group of records is valid for a particular crawler. We will cover order of precedence later in this document.
That same document talks about the precedence order that they use, but it seems that they would have the same interpretation as the current implementation.
All that said, one of the weaknesses of REP is that there isn't one clear answer, and it is mostly set forth by the original RFC and then by mass adoption / convention (mostly driven by Google).
It's also entirely possible that I've made a mistake :-)
That's not how robots.txt
files work. It's a common misunderstanding that *
applies to all bots. It does not. It only applies to bots that are not matched by other sections. Yes, this means you must repeat rules if you declare specific sections for different bots. I didn't write the robots.txt
specification, but this is the letter of the specification. The specification for robots.txt
makes it clear that robots only have to look at one section of rules: specifically the first section that they match.
Specifically, look at the example from the RFC under Section 4:
# /robots.txt for http://www.fict.org/ # comments to webmaster@fict.org User-agent: unhipbot Disallow: / User-agent: webcrawler User-agent: excite Disallow: User-agent: * Disallow: /org/plans.html Allow: /org/ Allow: /serv Allow: /~mak Disallow: /
The following matrix shows which robots are allowed to access URLs:
unhipbot webcrawler other & excite http://www.fict.org/ No Yes No http://www.fict.org/index.html No Yes No http://www.fict.org/robots.txt Yes Yes Yes http://www.fict.org/server.html No Yes Yes http://www.fict.org/services/fast.html No Yes Yes http://www.fict.org/services/slow.html No Yes Yes http://www.fict.org/orgo.gif No Yes No http://www.fict.org/org/about.html No Yes Yes http://www.fict.org/org/plans.html No Yes No http://www.fict.org/%7Ejim/jim.html No Yes No http://www.fict.org/%7Emak/mak.html No Yes Yes
Specifically notice that webcrawler & excite are allowed to crawl http://www.fict.org/org/about.html
.
Hey,
thank you all for taking the time to correct, explain and direct to relevant sources - this is very much appreciated and helps me a lot.
👍