embedbug: A PHP repository from kennethrapp

#EmbedBug

EmbedBug is a web scraper/site profiler in php. Designed to return xpath, tag and text content queries in a common array format to make parsing and json APIs easy.

include_once("EmbedBug.php")

or include it with your Composer project's composer.json:

"require":{
    "kennethrapp/embedbug:"dev-master"
}

and call it: require_once("vendor/autoload.php"); $EmbedBug = new Embedbug\Embedbug();

Now pass it some urls (either one or multiple) as an array:

$EmbedBug->SetURLs(array("http://arstechnica.com"));

Methods to handle queries pair a url with either an array of xpaths, an array of tag names, or a text value. Each of these methods will return an array containing the url, an md5 signature (of the url and each of the paths or tags, representing the query in total), the returned http code and content type, and a 'data' array containing the actual result of the query.

array(
    [url] => http://arstechnica.com
    [hash] => 23c07747b6f003e06597865f8e3628a2
    [http-code] => 200
    [content-type] => text/html; charset=UTF-8
    [data] => Array ( ... )
)

The data array will contain an index for each returned tag, with the tag name as the index, and all attributes (including the text content) as keys.

retrieve tags by name

$EmbedBug->GetTags( string $url, array $tags);

(note $tags is a simple array of tag names)

   $result = $EmbedBug->GetTags("http://arstechnica.com", array("meta"));
   var_dump($result);

Array
(
[url] => http://arstechnica.com
[hash] => 203291d26e9b01e08e40956308150364
[http-code] => 200
[content-type] => text/html; charset=UTF-8
[data] => Array
    (
        [meta] => Array
            (
                [0] => Array
                    (
                        [tag] => meta
                        [name] => application-name
                        [content] => Ars Technica
                    )

                [1] => Array
                    (
                        [tag] => meta
                        [name] => msapplication-starturl
                        [content] => http://arstechnica.com/
                    )

                [2] => Array
                    (
                        [tag] => meta
                        [name] => msapplication-tooltip
                        [content] => Ars Technica: Serving the technologist for 1.2 decades
                    )
                 (... etc etc etc)
          )
    )
)

retrieve tags by xpath

$EmbedBug->GetXPaths( string $url, array $paths);

(note $paths is an array with key=query name, value=associated xpath)

   $result = $EmbedBug->GetXPaths("http://arstechnica.com", array("open-graph"=>"//meta[contains(@property, 'og:')]"));
 var_dump($result);

Array
(
[url] => http://arstechnica.com
[hash] => cb1d01f1025ed5272482e44a1fe6d8d2
[http-code] => 200
[content-type] => text/html; charset=UTF-8
[data] => Array
    (
        [open-graph] => Array
            (
                [0] => Array
                    (
                        [tag] => meta
                        [property] => og:site_name
                        [content] => Ars Technica
                    )

                [1] => Array
                    (
                        [tag] => meta
                        [property] => og:title
                        [content] => Ars Technica
                    )

                [2] => Array
                    (
                        [tag] => meta
                        [property] => og:type
                        [content] => website
                    )
                (etc..)
        )))

retrieve text content b

$EmbedBug->GetText( string $url, array $tags, string $text);

*(note $tags is a simple array of tag names or , $text will apply to all of the tags)