robots-txt-scanner ##################
This scanner is primarily built for validating robots.txt as part of the Minion's basic scan.
It's easy. Just pip install robots-scanner
(see
robots-scanner on pypi.
Usage?
text = """
# COMMENT GOES HERE
User-agent: Google # inline comments
User-agent: Yahoo
Disallow: /admin
Disallow: /tmp/
Allow: /index.html
User-agent: Yahoo
# Some comments here
Disallow: /yahoo/
User-agent: Bing
Disallow: /bing/
"""
import pprint
pprint.pprint(
tokens_to_ast(
text_to_tokens(text)), indent=2)
r = Robotstxt()
r.parse(text)
pprint.pprint(r._tree)
And the output looks like this:
{ 'errors': [{ 14: 'Disallow rule accepts only relative path'}],
'records': [ { 'agents': ['Google', 'Yahoo'],
'rules': { 'allow': ['/index.html'],
'disallow': ['/admin', '/tmp/']}},
{ 'agents': ['Yahoo'], 'rules': { 'disallow': ['/yahoo/']}},
{ 'agents': ['Bing'],
'rules': { 'disallow': ['/bing/', 'http://domain.org/bad']}}]}
{'Bing': {'disallow': ['/bing/', 'http://domain.org/bad']},
'Google': {'allow': ['/index.html'], 'disallow': ['/admin', '/tmp/']},
'Yahoo': {'allow': ['/index.html'],
'disallow': ['/admin', '/tmp/', '/yahoo/']}}
The first dictionary is the AST of the robots.txt and the second
dictionary is a tree that user can query. See Robotstxt
in parser.py
for the public API.
-
Rewrite the entire scanner with a custom lexer and token.
-
Support custom extensions added to the lexer and parser.
- Added support for
Visit-time
,Crawl-delay
andRequest-rate
directives.