Goutte, a simple PHP Web Scraper
Goutte is a screen scraping and web crawling library for PHP.
Goutte provides a nice API to crawl websites and extract data from the HTML/XML responses.
Requirements
Goutte depends on PHP 5.5+ and Guzzle 6+.
Tip
If you need support for PHP 5.4 or Guzzle 4-5, use Goutte 2.x (latest phar).
If you need support for PHP 5.3 or Guzzle 3, use Goutte 1.x (latest phar).
Installation
Add fabpot/goutte
as a require dependency in your composer.json
file:
composer require fabpot/goutte
Usage
Create a Goutte Client instance (which extends
Symfony\Component\BrowserKit\Client
):
use Goutte\Client;
$client = new Client();
Make requests with the request()
method:
// Go to the symfony.com website
$crawler = $client->request('GET', 'https://www.symfony.com/blog/');
The method returns a Crawler
object
(Symfony\Component\DomCrawler\Crawler
).
To use your own Guzzle settings, you may create and pass a new Guzzle 6 instance to Goutte. For example, to add a 60 second request timeout:
use Goutte\Client;
use GuzzleHttp\Client as GuzzleClient;
$goutteClient = new Client();
$guzzleClient = new GuzzleClient(array(
'timeout' => 60,
));
$goutteClient->setClient($guzzleClient);
Click on links:
// Click on the "Security Advisories" link
$link = $crawler->selectLink('Security Advisories')->link();
$crawler = $client->click($link);
Extract data:
// Get the latest post in this category and display the titles
$crawler->filter('h2 > a')->each(function ($node) {
print $node->text()."\n";
});
Submit forms:
$crawler = $client->request('GET', 'https://github.com/');
$crawler = $client->click($crawler->selectLink('Sign in')->link());
$form = $crawler->selectButton('Sign in')->form();
$crawler = $client->submit($form, array('login' => 'fabpot', 'password' => 'xxxxxx'));
$crawler->filter('.flash-error')->each(function ($node) {
print $node->text()."\n";
});
More Information
Read the documentation of the BrowserKit and DomCrawler Symfony Components for more information about what you can do with Goutte.
Pronunciation
Goutte is pronounced goot
i.e. it rhymes with boot
and not out
.
Technical Information
Goutte is a thin wrapper around the following fine PHP libraries:
- Symfony Components: BrowserKit, CssSelector and DomCrawler;
- Guzzle HTTP Component.
License
Goutte is licensed under the MIT license.