jamesturk/scrapelib

robotparser is wrong?

Closed this issue · 2 comments

The two most recent times I've tried to use scrapelib defaulting to honoring the robots.txt files, I've gotten blocked, even though my read of the Robots files says I shouldn't.

I found this on SO: http://stackoverflow.com/questions/15344253/robotparser-doesnt-seem-to-parse-correctly and checked out reppy, and i do get conflicting answers for my specific case:

>>> url = 'https://en.wikipedia.org/wiki/List_of_United_States_counties_and_county-equivalents'
>>> user_agent = 'scrapelib 0.9.0 python-requests/1.2.3 CPython/2.7.5 Darwin/13.0.0'
>>> from reppy.cache import RobotsCache
>>> robots = RobotsCache()
>>> robots.allowed(url,user_agent)
True

>>> import robotparser
>>> parser = robotparser.RobotFileParser()
>>> parser.set_url('http://en.wikipedia.org/robots.txt')
>>> parser.read()
>>> parser.can_fetch(user_agent,url)
False

Think it's worth switching to use reppy instead? It sounds like the robots.txt spec is just not very well articulated, but the Wikipedia robots.txt file says specifically "Friendly, low-speed bots are welcome viewing article pages, but not dynamically-generated pages please" ...

hm thanks for pointing this out, I always disable robots.txt checking for my purposes so I'd never come across this

looking into it, if there are competing implementations maybe it makes sense to remove (since i imagine most people who really care can do their own checking using their preferred lib)

decided robot.txt feature is probably best handled by other libraries, 0.10 (still in development) will drop this