Tarantula is a web crawler written in PHP. It utilizes the amazing work of the people behind Guzzle and Symfony's DomCrawler.
Make sure ~/.composer/bin
is in your $PATH
and then simply execute:
composer global require mihaeu/tarantula:1.*
Assuming you are using Composer, add the following to your composer.json
file:
{
"require": {
"mihaeu/tarantula": "1.*"
}
}
or use Composer's cli tool composer require mihaeu/tarantula:1.*
.
Right now the only command available is crawl
. Some usage examples would be:
# most basic use case
tarantula crawl "http://google.com"
# go deeper
tarantula crawl "http://products.com/categories" --depth=4
# mirror
tarantula crawl "http://myblog.com" --mirror=/tmp/blog-backup
# filters
tarantula crawl "http://myblog.com" --contains=yolo
tarantula crawl "http://myblog.com" --regex="(post)\|(\d+)"
# dump crawled file in hashed files
tarantula crawl "http://myblog.com" --save-hashed=/tmp/blog-backup --minify-html
# HTTP basic auth
tarantula crawl "http://secure.com" --user=admin --password=admin
# search for "Avatar" on imdb
bin/tarantula crawl "http://www.imdb.com/find?q=avatar&s=all" --depth=0 --quiet --css=".findSection td.result_text"
# today's weather in seattle
bin/tarantula crawl --depth=0 "http://www.weather.com/weather/today/Seattle+WA+USWA0395:1:US" --css=".wx-first" | head -n 2
For all arguments and options use the help
command:
tarantula help # displays all available commands
tarantula help crawl # all arguments and options for the crawler
tarantula crawl "..." --verbose # switch on debugging output
Have a look at the tests to see what's possible or just try the following in your code:
use Mihaeu\Tarantula\Crawler;
use Mihaeu\Tarantula\HttpClient;
$crawler = new Crawler(new HttpClient('http://google.com'));
$links = $crawler->go(1);
All HTTP requests go through Guzzle
and you can add any configuration for Guzzle
's request object also to Tarantula's HttpClient
.
Test coverage is not at 100%, the reason being that this was an afternoon project and testing a crawler takes a lot of time due to the testing setup.
If you want to get a quick overview of the project, I recommend running the test suite with the --testdox
flag:
vendor/bin/phpunit --testdox
- filters (url, filetype, etc.)
- allow for Guzzle to be configured via command line
- more actions (save plain result, crawl via DOM/XPath, ...)
This is most likely due to a conflict with some requirements of other global installs. Unfortunately Composer's architecture doesn't offer a solution for this yet. I tried to keep the requirements Tarantula loose to avoid this problem.
If you want to have Tarantula available throughout your system, just install to another directory (e.g. using composer create-project
) and symlink bin/tarantula
into a folder in your $PATH
.
- Symfony/SensioLabs and especially Fabien Potencier for what he does for PHP (for this particular project the DomCrawler)
- the Guzzle team for their awesome HTTP client
- Aha Soft for the logo
- the Composer team for revolutionizing the way I and many others write PHP
- GitHub for redefining collaboration
- Travis CI for improving the quality and compatibility of thousands of open source projects
- Sebastian Bergmann for PHPUnit and many other awesome QA tools
MIT, see LICENSE
file.