hQuery.php

An extremely fast and efficient web scraper that parses megabytes of HTML in a blink of an eye.

Features

Very fast parsing and lookup
Parses broken HTML
jQuery-like style of DOM traversal
Low memory usage
Can handle big HTML documents (I have tested up to 20Mb, but the limit is the amount of RAM you have)
Doesn't require cURL to be installed
Automatically handles redirects (301, 302, 303)
Caches response for multiple processing tasks
PHP 5+

Usage

  include '/path/to/libs/hquery.php';

  // Set the cache path - must be a writable folder
  hQuery::$cache_path = "/path/to/cache";

  // Open a remote HTML document
  $doc = hQuery::fromUrl('http://example.com/someDoc.html');

  // Open a local HTML document
  $doc = hQuery::fromFile('/path/to/filesystem/doc.html');

  // Load HTML from a string
  $doc = hQuery::fromHTML('<html><head><title>Sample HTML Doc</title><body>Contents...</body></html>');

  // Set base_url, in case the document is loaded from local source.
  // Note: The base_url is used to retrive absolute URLs from relative ones
  $doc->base_url = 'http://desired-host.net/path';

  // Find all banners (images inside anchors)
  $banners = $doc->find('a > img:parent');

  // Extract links and images
  $links  = array();
  $images = array();
  $titles = array();
  foreach($banners as $pos => $a) {
      $links[$pos] = $a->attr('href');
      $titles[$pos] = trim($a->text()); // strip all HTML tags and leave just text
      $images[$pos] = $a->find('img')->attr('src');
  }

  // Read charset of the original document (internally it is converted to UTF-8)
  $charset = $doc->charset;

  // Get the size of the document ( strlen($html) )
  $size = $doc->size;

Demo

On DUzun.Me

#TODO

Add Unit tests
Document everything
Add more selectors
Improve selectors to be able to select by attributes

gibex/hQuery.php

hQuery.php

Features

Usage

Demo