/rbt_prs

A robots.txt parser written in PHP, useful for checking whether or not your bot has access to a given URL. Should probably be cleaned up a little in the future™ due to being a little on the messier side.

Primary LanguagePHP

/// copyleft 2012 meklu (public domain)

rbt_prs is a robots.txt parser written in PHP and intended for checking
whether or not your bot can access a given URL.

Currently it provides just one function:
	isUrlBotSafe($url, $your_useragent, $robots_txt, $redirects, $debug)
Setting any of the last four parameters to NULL will make them use their
default values.

The first argument, $url, is the only required argument and means the URL you
would like to check.

The second argument, $your_useragent, is the user agent string your bot will
use. While it isn't mandatory to select one, it is highly advised. To mention
this software in your user agent string, you could invoke the function as
	isUrlBotSafe("http://example.com/", "Fancybot/0.1 (awesome) " .
		     RBT_PRS_UA);
Notice the space after the description. The constant RBT_PRS_UA doesn't have
one in it.

The third argument, $robots_txt, is useful when you want to go through several
URLs with just a single download of robots.txt. It allows you to use your own
rules on a given URL. You could utilise the argument like this:
	$myrules = file_get_contents("http://example.com/robots.txt");
	isUrlBotSafe("http://example.com/", "Fancybot/0.1", $myrules);
	isUrlBotSafe("http://example.com/contact", "Fancybot/0.1", $myrules);
	isUrlBotSafe("http://example.com/blog", "Fancybot/0.1", $myrules);
	isUrlBotSafe("http://example.com/about", "Fancybot/0.1", $myrules);

The fourth argument, $redirects, is only useful when you let the function
download the robots.txt. It allows you to handle HTTP redirects in a slightly
better way than having to miserably fail when grabbing the file.
Values that are 1 or less will disable redirects, whereas FALSE will leave
things the way they are. TRUE will set the value to 20. Here's how to use a
similar approach with $robots_txt defined.
	$opts = array('http' => array('method' => 'GET',
				      'max_redirects' => 20)
		     );
	$context = stream_context_create($opts);
	$myrules = file_get_contents("http://example.com/robots.txt", FALSE,
				     $context);

The fifth argument, $debug, will make the function verbose and have it print
all sorts of useful analysis on the rules.

This piece of software is already in use in real life at
http://cheesetalks.twolofbees.com/humble/ and was originally written for the
site author just because it was the correct way of doing things.