/tarantula

Another PHP crawler based on Guzzle.

Primary LanguageHTMLMIT LicenseMIT

Logo Tarantula

Build Status Coverage Status SensioLabsInsight

Tarantula is a web crawler written in PHP. It utilizes the amazing work of the people behind Guzzle and Symfony's DomCrawler.

Installation

Global tool

Make sure ~/.composer/bin is in your $PATH and then simply execute:

composer global require mihaeu/tarantula:1.*

Library

Assuming you are using Composer, add the following to your composer.json file:

{
    "require": {
        "mihaeu/tarantula": "1.*"
    }
}

or use Composer's cli tool composer require mihaeu/tarantula:1.*.

Usage

Global tool

Right now the only command available is crawl. Some usage examples would be:

# most basic use case
tarantula crawl "http://google.com"

# go deeper
tarantula crawl "http://products.com/categories" --depth=4

# mirror
tarantula crawl "http://myblog.com" --mirror=/tmp/blog-backup

# filters
tarantula crawl "http://myblog.com" --contains=yolo
tarantula crawl "http://myblog.com" --regex="(post)\|(\d+)"

# dump crawled file in hashed files
tarantula crawl "http://myblog.com" --save-hashed=/tmp/blog-backup --minify-html

# HTTP basic auth
tarantula crawl "http://secure.com" --user=admin --password=admin

# search for "Avatar" on imdb
bin/tarantula crawl "http://www.imdb.com/find?q=avatar&s=all" --depth=0 --quiet --css=".findSection td.result_text"

# today's weather in seattle
bin/tarantula crawl --depth=0 "http://www.weather.com/weather/today/Seattle+WA+USWA0395:1:US" --css=".wx-first" | head -n 2

For all arguments and options use the help command:

tarantula help                    # displays all available commands
tarantula help crawl              # all arguments and options for the crawler
tarantula crawl "..." --verbose   # switch on debugging output

Library

Have a look at the tests to see what's possible or just try the following in your code:

use Mihaeu\Tarantula\Crawler;
use Mihaeu\Tarantula\HttpClient;

$crawler = new Crawler(new HttpClient('http://google.com'));
$links = $crawler->go(1);

All HTTP requests go through Guzzle and you can add any configuration for Guzzle's request object also to Tarantula's HttpClient.

Tests

Test coverage is not at 100%, the reason being that this was an afternoon project and testing a crawler takes a lot of time due to the testing setup.

If you want to get a quick overview of the project, I recommend running the test suite with the --testdox flag:

vendor/bin/phpunit --testdox

To Do

  • filters (url, filetype, etc.)
  • allow for Guzzle to be configured via command line
  • more actions (save plain result, crawl via DOM/XPath, ...)

Troubleshooting

Composer global install fails

This is most likely due to a conflict with some requirements of other global installs. Unfortunately Composer's architecture doesn't offer a solution for this yet. I tried to keep the requirements Tarantula loose to avoid this problem.

If you want to have Tarantula available throughout your system, just install to another directory (e.g. using composer create-project) and symlink bin/tarantula into a folder in your $PATH.

Thanks to

License

MIT, see LICENSE file.