A flexible spider in PHP.
A spider contains many processors called pipes
, you can pass as many tasks as you like to the spider, each task go through these pipes
and get processed.
composer require ddliu/spider
- PHP5.3+
- curl(RequestPipe)
See composer.json
.
use ddliu\spider\Spider;
use ddliu\spider\Pipe\NormalizeUrlPipe;
use ddliu\spider\Pipe\RequestPipe;
use ddliu\spider\Pipe\DomCrawlerPipe;
(new Spider())
->pipe(new NormalizeUrlPipe())
->pipe(new RequestPipe())
->pipe(new DomCrawlerPipe())
->pipe(function($spider, $task) {
$task['$dom']->filter('a')->each(function($a) use ($task) {
$href = $a->attr('href');
$task->fork($href);
})
})
// the entry task
->addTask('http://example.com')
->run()
->report();
Find more examples in examples
folder.
The Spider
class.
- limit: maxmum tasks to run
pipe($pipe)
: add a pipeaddTask($task)
: add a taskrun()
: run the spiderreport()
: write report to log
A task contains the data array and some helper functions.
The Task
class implements ArrayAccess
interface, so you can access data like array.
fork($task)
: add a sub task to the spiderignore()
: ignore the task
Pipes define how each task being processed.
A pipe can be a function:
function($spider, $task) {}
Or extends the BasePipe:
use ddliu\spider\Pipe\BasePipe;
class MyPipe extends BasePipe {
public function run($spider, $task) {
// process the task...
}
}
Normalize $task['url']
.
new NormalizeUrlPipe()
Start an HTTP request with $task['url']
and save the result in $task['content']
.
new RequestPipe(array(
'useragent' => 'myspider',
'timeout' => 10
));
Cache a pipe (e.g. RequestPipe
).
$requestPipe = new RequestPipe();
$cacheForReqPipe = new FileCachePipe($requestPipe, [
'input' => 'url',
'output' => 'content',
'root' => '/path/to/cache/root',
]);
Retry on failure.
$requestPipe = new RequestPipe();
$retryForReqPipe = new RetryPipe($requestPipe, [
'count' => 10,
]);
Create a DomCrawler from $task['content']
. Access it with $task['$dom']
in following pipes.
Report every 10 minutes.
new ReportPipe(array(
'seconds' => 600
))
$spider->logger
is an instance of Monolog\Logger
. You can add logging handlers to it before start:
use Monolog\Handler\StreamHandler;
$spider->logger->pushHandler(new StreamHandler('path/to/your.log', Logger::WARNING));
- Real world examples.
- Running tasks concurrently.(With pthread?)
Use golang version for better performance!