/COWS

Corion's Own Web Scraper

Primary LanguagePerlArtistic License 2.0Artistic-2.0

Windows MacOS Linux

NAME

COWS - Corion's Own Web Scraper

SYNOPSIS

use 5.020;
use COWS 'scrape';

my $html = '...';
my $rules = [
    {
        name => 'link',
        query => 'a@href',
        munge => ['absolute'],
    },
];
my $res = scrape( $html, $rules, { url => 'https://example.com/' } );

say "url: $_" for $res->{link}->@*;

scrape

scrape( $html, $rules, $options );

QUERY SYNTAX

For each query, a hashref with the following keys is accepted

  • name

    The name of this query. This will be used as a key in the resulting hashref.

  • query

    The CSS selector or XPath query to search for.

  • fields

    Queries that should be matched below this node. The results will get merged into a hashref. Duplicate names are not allowed (duh).

      fields => [
          { name => 'foo', ... },
          { name => 'bar', ... },
          { name => 'baz', ... },
      ]
    

    results in

      { foo => ..., bar => ..., baz => ... }
    
  • debug

    Output progress while stepping through this query. This is convenient for finding why a specific query doesn't result in what you think it should.

  • single

    Expect only a single item, result will be the value of the query. If the query has a fields field, it will be a hashref of the fields.

    If this key is missing, the result will always be an arrayref.

  • index

    Use the n-th node as result. The result will always be a hashref or scalar value. This could be done in XPath but sometimes it's easier to do it here.

  • discard

    Do not use this intermediate value but replace it by the arrayref of its fields value.

  • html

    Include the value of this node as an HTML string.

  • munge

    Apply the functions in the listed order to the value.

  • tag

      tag => foo=bar
      tag => baz:bat
    

    Add the key/value to the resulting hash. If the separator is = , the result will be a plain scalar:

      foo => 'bar',
    

    If the separator is :, there can be multiple tags and they are collected in an arrayref:

      baz => [ 'bat' ],