robotparser is wrong?

Question

robotparser is wrong?

Closed this issue 10 years ago · 2 comments

The two most recent times I've tried to use scrapelib defaulting to honoring the robots.txt files, I've gotten blocked, even though my read of the Robots files says I shouldn't.

I found this on SO: http://stackoverflow.com/questions/15344253/robotparser-doesnt-seem-to-parse-correctly and checked out reppy, and i do get conflicting answers for my specific case:

>>> url = 'https://en.wikipedia.org/wiki/List_of_United_States_counties_and_county-equivalents'
>>> user_agent = 'scrapelib 0.9.0 python-requests/1.2.3 CPython/2.7.5 Darwin/13.0.0'
>>> from reppy.cache import RobotsCache
>>> robots = RobotsCache()
>>> robots.allowed(url,user_agent)
True

>>> import robotparser
>>> parser = robotparser.RobotFileParser()
>>> parser.set_url('http://en.wikipedia.org/robots.txt')
>>> parser.read()
>>> parser.can_fetch(user_agent,url)
False

Think it's worth switching to use reppy instead? It sounds like the robots.txt spec is just not very well articulated, but the Wikipedia robots.txt file says specifically "Friendly, low-speed bots are welcome viewing article pages, but not dynamically-generated pages please" ...

Answer 1 · 2013-11-25T19:02:39.000Z

hm thanks for pointing this out, I always disable robots.txt checking for my purposes so I'd never come across this

looking into it, if there are competing implementations maybe it makes sense to remove (since i imagine most people who really care can do their own checking using their preferred lib)

Answer 2 · 2014-04-17T02:00:42.000Z

decided robot.txt feature is probably best handled by other libraries, 0.10 (still in development) will drop this