Robots.io is a Java library designed to make parsing a websites 'robots.txt' file easy.
The RobotsParser class provides all the functionality to use robots.io.
The Javadoc for Robots.io can be found here.To parse the robots.txt for Google with the User-Agent string "test":
RobotsParser robotsParser = new RobotsParser("test");
robotsParser.connect("http://google.com");
Alternatively, to parse with no User-Agent, simply leave the constructor blank.
You can also pass a domain with a path.
robotsParser.connect("http://google.com/example.htm"); //This would also be valid
Note: Domains can either be passed in string form or as a URL object to all methods.
To check if a URL is allowed:
robotsParser.isAllowed("http://google.com/test"); //Returns true if allowed
Or, to get all the rules parsed from the file:
robotsParser.getDisallowedPaths(); //This will return an ArrayList of Strings
The results parsed are cached in the robotsParser object until the connect()
method is called again, overwriting the previously parsed data
In the event that all access is denied, a RobotsDisallowedException
will be thrown.
Domains passed to RobotsParser are normalised to always end in a forward slash. Disallowed Paths returned will never begin with a forward slash. This is so that URL's can easily be constructed. For example:
robotsParser.getDomain() + robotsParser.getDisallowedPaths().get(0); // http://google.com/example.htm
Robots.io is distributed under the GPL.