Crawl links on a website
This package provides a class to crawl links on a website. Under the hood Guzzle promises are used to crawl multiple urls concurrently.
Because the crawler can execute JavaScript, it can crawl JavaScript rendered site. Under the hood headless Chrome is used to power this feature.
Spatie is a webdesign agency in Antwerp, Belgium. You'll find an overview of all our open source projects on our website.
Postcardware
You're free to use this package (it's MIT-licensed), but if it makes it to your production environment we highly appreciate you sending us a postcard from your hometown, mentioning which of our package(s) you are using.
Our address is: Spatie, Samberstraat 69D, 2060 Antwerp, Belgium.
All postcards are published on our website.
Installation
This package can be installed via Composer:
composer require spatie/crawler
Usage
The crawler can be instantiated like this
Crawler::create()
->setCrawlObserver(<implementation of \Spatie\Crawler\CrawlObserver>)
->startCrawling($url);
The argument passed to setCrawlObserver
must be an object that implements the \Spatie\Crawler\CrawlObserver
interface:
/**
* Called when the crawler will crawl the given url.
*
* @param \Spatie\Crawler\Url $url
*/
public function willCrawl(Url $url);
/**
* Called when the crawler has crawled the given url.
*
* @param \Spatie\Crawler\Url $url
* @param \Psr\Http\Message\ResponseInterface $response
* @param \Spatie\Crawler\Url $foundOn
*/
public function hasBeenCrawled(Url $url, ResponseInterface $response, Url $foundOn);
/**
* Called when the crawl has ended.
*/
public function finishedCrawling();
Executing JavaScript
By default the crawler will not execute JavaScript. This is how you can enable the execution of JavaScript:
Crawler::create()
->executeJavaScript()
...
Under the hood headless Chrome is used to execute JavaScript. Here are some pointers on how to install it on your system.
The package will make an educated guess as to where Chrome is installed on your system. You can also manually pass the location of the Chrome binary to executeJavaScript()
Crawler::create()
->executeJavaScript($pathToChrome)
...
Filtering certain urls
You can tell the crawler not to visit certain urls by passing using the setCrawlProfile
-function. That function expects
an objects that implements the Spatie\Crawler\CrawlProfile
-interface:
/*
* Determine if the given url should be crawled.
*/
public function shouldCrawl(Url $url): bool;
This package comes with two CrawlProfiles
out of the box:
CrawlAllUrls
: this profile will crawl all urls on all pages including urls to an external site.CrawlInternalUrls
: this profile will only crawl the internal urls on the pages of a host.
Setting the number of concurrent requests
To improve the speed of the crawl the package concurrently crawls 10 urls by default. If you want to change that number you can use the setConcurrency
method.
Crawler::create()
->setCrawlObserver(<implementation of \Spatie\Crawler\CrawlObserver>)
->setConcurrency(1) //now all urls will be crawled one by one
->startCrawling($url);
Changelog
Please see CHANGELOG for more information what has changed recently.
Contributing
Please see CONTRIBUTING for details.
Testing
To run the tests you'll have to start the included node based server first in a separate terminal window.
cd tests/server
./start_server.sh
With the server running, you can start testing.
vendor/bin/phpunit
Security
If you discover any security related issues, please email freek@spatie.be instead of using the issue tracker.
Credits
About Spatie
Spatie is a webdesign agency in Antwerp, Belgium. You'll find an overview of all our open source projects on our website.
License
The MIT License (MIT). Please see License File for more information.