/crawler

Crawl all links found on a website

Primary LanguagePHPMIT LicenseMIT

Crawl links on a website

Latest Version on Packagist Software License Build Status SensioLabsInsight Quality Score StyleCI Total Downloads

This package provides a class to crawl links on a website.

Spatie is a webdesign agency in Antwerp, Belgium. You'll find an overview of all our open source projects on our website.

Postcardware

You're free to use this package (it's MIT-licensed), but if it makes it to your production environment you are required to send us a postcard from your hometown, mentioning which of our package(s) you are using.

Our address is: Spatie, Samberstraat 69D, 2060 Antwerp, Belgium.

The best postcards will get published on the open source page on our website.

Installation

This package can be installed via Composer:

composer require spatie/crawler

Usage

The crawler can be instantiated like this

Crawler::create()
    ->setCrawlObserver(<implementation of \Spatie\Crawler\CrawlObserver>)
    ->startCrawling($url);

The argument passed to setCrawlObserver must be an object that implements the \Spatie\Crawler\CrawlObserver interface:

/**
 * Called when the crawler will crawl the given url.
 *
 * @param \Spatie\Crawler\Url $url
 */
public function willCrawl(Url $url);

/**
 * Called when the crawler has crawled the given url.
 *
 * @param \Spatie\Crawler\Url       $url
 * @param \Psr\Http\Message\ResponseInterface $response
 */
public function hasBeenCrawled(Url $url, ResponseInterface $response);

/**
 * Called when the crawl has ended.
 */
public function finishedCrawling();

Filtering certain url's

You can instruct the crawler not to visit certain url's by using the setCrawlProfile method. It expects an object that implements the Spatie\Crawler\CrawlProfile interface:

/**
 * Set the crawl profile.
 *
 * @param \Spatie\Crawler\CrawlProfile $crawlProfile
 *
 * @return $this
 */
public function setCrawlProfile(CrawlProfile $crawlProfile)
{
    $this->crawlProfile = $crawlProfile;
    return $this;
}

Changelog

Please see CHANGELOG for more information what has changed recently.

Contributing

Please see CONTRIBUTING for details.

Security

If you discover any security related issues, please email freek@spatie.be instead of using the issue tracker.

Credits

About Spatie

Spatie is a webdesign agency in Antwerp, Belgium. You'll find an overview of all our open source projects on our website.

License

The MIT License (MIT). Please see License File for more information.