/robots.io

Robots.txt parsing library

Primary LanguageJavaGNU General Public License v3.0GPL-3.0

Robots.io is a Java library designed to make parsing a websites 'robots.txt' file easy.

How to use

The RobotsParser class provides all the functionality to use robots.io.

The Javadoc for Robots.io can be found here.

Examples

Connecting

To parse the robots.txt for Google with the User-Agent string "test":

RobotsParser robotsParser = new RobotsParser("test");
robotsParser.connect("http://google.com");

Alternatively, to parse with no User-Agent, simply leave the constructor blank.

You can also pass a domain with a path.

robotsParser.connect("http://google.com/example.htm"); //This would also be valid

Note: Domains can either be passed in string form or as a URL object to all methods.

Querying

To check if a URL is allowed:

robotsParser.isAllowed("http://google.com/test"); //Returns true if allowed

Or, to get all the rules parsed from the file:

robotsParser.getDisallowedPaths(); //This will return an ArrayList of Strings

The results parsed are cached in the robotsParser object until the connect() method is called again, overwriting the previously parsed data

Politeness

In the event that all access is denied, a RobotsDisallowedException will be thrown.

URL Normalisation

Domains passed to RobotsParser are normalised to always end in a forward slash. Disallowed Paths returned will never begin with a forward slash. This is so that URL's can easily be constructed. For example:

robotsParser.getDomain() + robotsParser.getDisallowedPaths().get(0); // http://google.com/example.htm

Licensing

Robots.io is distributed under the GPL.