/robots-txt-file

Models a robots.txt file

Primary LanguagePHPMIT LicenseMIT

robots-txt-file Build Status

Introduction

Overview

Handles robots.txt files:

  • parse a robots.txt file into a model
  • get directives for a user agent
  • check if a user agent is allowed to access a url path
  • extract sitemap URLs
  • programmatically create a model and cast to a string

Robots.txt file format refresher

Let's quickly go over the format of a robots.txt file so that you can understand what you can get out of a \webignition\RobotsTxt\File\File object.

A robots.txt file contains a collection of records. A record provides a set of directives to a specified user agent. A directive instructs a user agent to do something (or not do something). A blank line is used to separate records.

Here's an example with two records:

User-agent: Slurp
Disallow: /

User-Agent: *
Disallow: /private

This instructs the user agent 'Slurp' that it is not allowed access to '/' (i.e. the whole site), and this instructs all other user agents that they are not allowed access to '/private'.

A robots.txt file can optionally contain directives that apply to all user agents irrespective of the specified records. These are included as a set of a directives that are not part of a record. A common use is the sitemap directive.

Here's an example with directives that apply to everyone and everything:

User-agent: Slurp
Disallow: /

User-Agent: *
Disallow: /private

Sitemap: http://example.com/sitemap.xml

Usage

Parsing a robots.txt file from a string into a model

<?php
use webignition\RobotsTxt\File\Parser;

$parser = new Parser();
$parser->setSource(file_get_contents('http://example.com/robots.txt'));

$robotsTxtFile = $parser->getFile();
 
// Get an array of records
$robotsTxtFile->getRecords();

// Get the list of record-independent directives (such as sitemap directives):
$robotsTxtFile->getNonGroupDirectives()->get();

This might not be too useful on it's own. You'd normally be retrieving information from a robots.txt file because you are a crawler and need to know what you are allowed to access (or disallowed) or because you're a tool or service that needs to locate a site's sitemap.xml file.

Inspecting a model to get directives for a user agent

Let's say we're the 'Slurp' user agent and we want to know what's been specified for us:

<?php
use webignition\RobotsTxt\File\Parser;
use webignition\RobotsTxt\Inspector\Inspector;

$parser = new Parser();
$parser->setSource(file_get_contents('http://example.com/robots.txt'));

$inspector = new Inspector($parser->getFile());
$inspector->setUserAgent('slurp');

$slurpDirectiveList = $inspector->getDirectives();

Ok, now we have a DirectiveList containing a collection of directives. We can call $directiveList->get() to get the directives applicable to us.

This raw set of directives is available in the model because it is there in the source robots.txt file. Often this raw data isn't immediately useful as-is. Maybe we want to inspect it further?

Check if a user agent is allowed to access a url path

That's more like it, let's inspect some of that data in the model.

<?php
use webignition\RobotsTxt\File\Parser;
use webignition\RobotsTxt\Inspector\Inspector;

$parser = new Parser();
$parser->setSource(file_get_contents('http://example.com/robots.txt'));

$inspector = new Inspector($parser->getFile());
$inspector->setUserAgent('slurp');

if ($inspector->isAllowed('/foo')) {
    // Do whatever is needed access to /foo is allowed
}

Extract sitemap URLs

A robots.txt file can list the URLs of all relevant sitemaps. These directives are not specific to a user agent.

Let's say we're an automated web frontend testing service and we need to find a site's sitemap.xml to find a list of URLs that need testing. We know the site's domain and we know where to look for the robots.txt file and we know that this might specify the location of the sitemap.xml file.

<?php
use webignition\RobotsTxt\File\Parser;

$parser = new Parser();
$parser->setSource(file_get_contents('http://example.com/robots.txt'));

$robotsTxtFile = $parser->getFile();

$sitemapDirectives = $robotsTxtFile->getNonGroupDirectives()->getByField('sitemap');
$sitemapUrl = (string)$sitemapDirectives->first()->getValue();

Cool, we've found the URL for the first sitemap listed in the robots.txt file. There may be many, although just the one is most common.

Filtering directives for a user agent to a specific field type

Let's get all the disallow directives for Slurp:

<?php
use webignition\RobotsTxt\File\Parser;
use webignition\RobotsTxt\Inspector\Inspector;

$parser = new Parser();
$parser->setSource(file_get_contents('http://example.com/robots.txt'));

$robotsTxtFile = $parser->getFile();

$inspector = new Inspector($robotsTxtFile);
$inspector->setUserAgent('slurp');

$slurpDisallowDirectiveList = $inspector->getDirectives()->getByField('disallow');

Building

Using as a library in a project

If used as a dependency by another project, update that project's composer.json and update your dependencies.

"require": {
    "webignition/robots-txt-file": "*"      
}

This will get you the latest version. Check the list of releases for specific versions.

Developing

This project has external dependencies managed with composer. Get and install this first.

# Make a suitable project directory
mkdir ~/robots-txt-file && cd ~/robots-txt-file

# Clone repository
git clone git@github.com:webignition/robots-txt-file.git.

# Retrieve/update dependencies
composer update

# Run code sniffer and unit tests
composer cs
composer test

Testing

Have look at the project on travis for the latest build status, or give the tests a go yourself.

cd ~/robots-txt-file
composer test