seomoz/rep-cpp

Path allowed despite Disallow for *

petermeissner opened this issue · 3 comments

Hey,

I have written an R based robots.txt parser (https://github.com/ropenscilabs/robotstxt). @hrbrmstr wrapped this library and suggested using it for big-x speedup (https://github.com/hrbrmstr/spiderbar).

Related issue: hrbrmstr/spiderbar#2

Now I have run my test cases against my implementation and against those wrapping rep-cpp and found a divergence which I think is a bug on your side. Consider the following example:

Consider the following robots.txt file:

User-agent: UniversalRobot/1.0
User-agent: mein-Robot
Disallow: /quellen/dtd/

User-agent: *
Disallow: /unsinn/
Disallow: /temp/
Disallow: /newsticker.shtml

In the example there are some directories forbidden for all robots e.g. /temp/ but when using rep-cpp for permission checking the path is indicated as ok when checking for bot mein-Robot which I am quite sure should not be the case. (rep-cpp is used for those function calls where check_method="spiderbar")

library(robotstxt)

rtxt <- "# robots.txt zu http://www.example.org/\n\nUser-agent: UniversalRobot/1.0\nUser-agent: mein-Robot\nDisallow: /quellen/dtd/\n\nUser-agent: *\nDisallow: /unsinn/\nDisallow: /temp/\nDisallow: /newsticker.shtml"

paths_allowed(
  paths          = "/temp/some_file.txt", 
  robotstxt_list = list(rtxt), 
  check_method   = "robotstxt",
  bot            = "*"
)
#> [1] FALSE

paths_allowed(
  paths          = "/temp/some_file.txt", 
  robotstxt_list = list(rtxt), 
  check_method   = "spiderbar",
  bot            = "*"
)
#> [1] FALSE

paths_allowed(
  paths          = "/temp/some_file.txt", 
  robotstxt_list = list(rtxt), 
  check_method   = "robotstxt",
  bot            = "mein-Robot"
)
#> [1] FALSE

paths_allowed(
  paths          = "/temp/some_file.txt", 
  robotstxt_list = list(rtxt), 
  check_method   = "spiderbar",
  bot            = "mein-Robot"
)
#> [1] TRUE

Thanks for writing in!

The section about User-Agent in the original RFC says:

These name tokens are used in User-agent lines in /robots.txt to identify to which specific robots the record applies. The robot must obey the first record in /robots.txt that contains a User-Agent line whose value contains the name token of the robot as a substring. The name comparisons are case-insensitive. If no such record exists, it should obey the first record with a User-agent line with a "*" value, if present. If no record satisfied either condition, or no records are present at all, access is unlimited.

That's far from decisive, but my interpretation of that is that the current behavior is correct - it follows the most specific stanza for the particular bot. It's pretty common for a robots.txt to group bots together with a set or rules or repeat that set of rules explicitly for each of a number of agents.

While Google is not exactly a standard, it has this to say on the matter:

The start-of-group element user-agent is used to specify for which crawler the group is valid. Only one group of records is valid for a particular crawler. We will cover order of precedence later in this document.

That same document talks about the precedence order that they use, but it seems that they would have the same interpretation as the current implementation.

All that said, one of the weaknesses of REP is that there isn't one clear answer, and it is mostly set forth by the original RFC and then by mass adoption / convention (mostly driven by Google).

It's also entirely possible that I've made a mistake :-)

That's not how robots.txt files work. It's a common misunderstanding that * applies to all bots. It does not. It only applies to bots that are not matched by other sections. Yes, this means you must repeat rules if you declare specific sections for different bots. I didn't write the robots.txt specification, but this is the letter of the specification. The specification for robots.txt makes it clear that robots only have to look at one section of rules: specifically the first section that they match.

Specifically, look at the example from the RFC under Section 4:

      # /robots.txt for http://www.fict.org/
      # comments to webmaster@fict.org

      User-agent: unhipbot
      Disallow: /

      User-agent: webcrawler
      User-agent: excite
      Disallow: 

      User-agent: *
      Disallow: /org/plans.html
      Allow: /org/
      Allow: /serv
      Allow: /~mak
      Disallow: /

The following matrix shows which robots are allowed to access URLs:

                                               unhipbot webcrawler other
                                                        & excite
     http://www.fict.org/                         No       Yes       No
     http://www.fict.org/index.html               No       Yes       No
     http://www.fict.org/robots.txt               Yes      Yes       Yes
     http://www.fict.org/server.html              No       Yes       Yes
     http://www.fict.org/services/fast.html       No       Yes       Yes
     http://www.fict.org/services/slow.html       No       Yes       Yes
     http://www.fict.org/orgo.gif                 No       Yes       No
     http://www.fict.org/org/about.html           No       Yes       Yes
     http://www.fict.org/org/plans.html           No       Yes       No
     http://www.fict.org/%7Ejim/jim.html          No       Yes       No
     http://www.fict.org/%7Emak/mak.html          No       Yes       Yes

Specifically notice that webcrawler & excite are allowed to crawl http://www.fict.org/org/about.html.

Hey,

thank you all for taking the time to correct, explain and direct to relevant sources - this is very much appreciated and helps me a lot.

👍