/python-xextract

Extract structured data from HTML and XML documents like a boss.

Primary LanguagePythonOtherNOASSERTION

xextract

Extract structured data from HTML and XML documents like a boss.

xextract is simple enough for writing a one-line parser, yet powerful enough to be used in a big project.

Features

  • Parsing of HTML and XML documents
  • Supports xpath and css selectors
  • Simple declarative style of parsers
  • Built-in self-validation to let you know when the structure of the website has changed
  • Speed - under the hood the library uses lxml library with compiled xpath selectors

Table of Contents

A little taste of it

Let's parse The Shawshank Redemption's IMDB page:

# fetch the website
>>> import requests
>>> response = requests.get('http://www.imdb.com/title/tt0111161/')

# parse like a boss
>>> from xextract import String, Group

# extract title with css selector
>>> String(css='h1[itemprop="name"]', count=1).parse(response.text)
'The Shawshank Redemption'

# extract release year with xpath selector
>>> String(xpath='//*[@id="titleYear"]/a', count=1, callback=int).parse(response.text)
1994

# extract structured data
>>> Group(css='.cast_list tr:not(:first-child)', children=[
...   String(name='name', css='[itemprop="actor"]', attr='_all_text', count=1),
...   String(name='character', css='.character', attr='_all_text', count=1)
... ]).parse(response.text)
[
 {'name': 'Tim Robbins', 'character': 'Andy Dufresne'},
 {'name': 'Morgan Freeman', 'character': "Ellis Boyd 'Red' Redding"},
 ...
]

Installation

To install xextract, simply run:

$ pip install xextract

Requirements: lxml, cssselect

Supported Python versions are 3.5 - 3.11.

Windows users can download lxml binary here.

Parsers

String

Parameters: name (optional), css / xpath (optional, default "self::*"), count (optional, default "*"), attr (optional, default "_text"), callback (optional), namespaces (optional)

Extract string data from the matched element(s). Extracted value is always unicode.

By default, String extracts the text content of only the matched element, but not its descendants. To extract and concatenate the text out of every descendant element, use attr parameter with the special value "_all_text":

Use attr parameter to extract the data from an HTML/XML attribute.

Use callback parameter to post-process extracted values.

Example:

>>> from xextract import String
>>> String(css='span', count=1).parse('<span>Hello <b>world</b>!</span>')
'Hello !'

>>> String(css='span', count=1, attr='class').parse('<span class="text-success"></span>')
'text-success'

# use special `attr` value `_all_text` to extract and concantenate text out of all descendants
>>> String(css='span', count=1, attr='_all_text').parse('<span>Hello <b>world</b>!</span>')
'Hello world!'

# use special `attr` value `_name` to extract tag name of the matched element
>>> String(css='span', count=1, attr='_name').parse('<span>hello</span>')
'span'

>>> String(css='span', callback=int).parse('<span>1</span><span>2</span>')
[1, 2]

Url

Parameters: name (optional), css / xpath (optional, default "self::*"), count (optional, default "*"), attr (optional, default "href"), callback (optional), namespaces (optional)

Behaves like String parser, but with two exceptions:

  • default value for attr parameter is "href"
  • if you pass url parameter to parse() method, the absolute url will be constructed and returned

If callback is specified, it is called after the absolute urls are constructed.

Example:

>>> from xextract import Url, Prefix
>>> content = '<div id="main"> <a href="/test">Link</a> </div>'

>>> Url(css='a', count=1).parse(content)
'/test'

>>> Url(css='a', count=1).parse(content, url='http://github.com/Mimino666')
'http://github.com/test'  # absolute url address. Told ya!

>>> Prefix(css='#main', children=[
...   Url(css='a', count=1)
... ]).parse(content, url='http://github.com/Mimino666')  # you can pass url also to ancestor's parse(). It will propagate down.
'http://github.com/test'

DateTime

Parameters: name (optional), css / xpath (optional, default "self::*"), format (required), count (optional, default "*"), attr (optional, default "_text"), callback (optional) namespaces (optional)

Returns the datetime.datetime object constructed out of the extracted data: datetime.strptime(extracted_data, format).

format syntax is described in the Python documentation.

If callback is specified, it is called after the datetime objects are constructed.

Example:

>>> from xextract import DateTime
>>> DateTime(css='span', count=1, format='%d.%m.%Y %H:%M').parse('<span>24.12.2015 5:30</span>')
datetime.datetime(2015, 12, 24, 50, 30)

Date

Parameters: name (optional), css / xpath (optional, default "self::*"), format (required), count (optional, default "*"), attr (optional, default "_text"), callback (optional) namespaces (optional)

Returns the datetime.date object constructed out of the extracted data: datetime.strptime(extracted_data, format).date().

format syntax is described in the Python documentation.

If callback is specified, it is called after the datetime objects are constructed.

Example:

>>> from xextract import Date
>>> Date(css='span', count=1, format='%d.%m.%Y').parse('<span>24.12.2015</span>')
datetime.date(2015, 12, 24)

Element

Parameters: name (optional), css / xpath (optional, default "self::*"), count (optional, default "*"), callback (optional), namespaces (optional)

Returns lxml instance (lxml.etree._Element) of the matched element(s). If you use xpath expression and match the text content of the element (e.g. text() or @attr), unicode is returned.

If callback is specified, it is called with lxml.etree._Element instance.

Example:

>>> from xextract import Element
>>> Element(css='span', count=1).parse('<span>Hello</span>')
<Element span at 0x2ac2990>

>>> Element(css='span', count=1, callback=lambda el: el.text).parse('<span>Hello</span>')
'Hello'

# same as above
>>> Element(xpath='//span/text()', count=1).parse('<span>Hello</span>')
'Hello'

Group

Parameters: name (optional), css / xpath (optional, default "self::*"), children (required), count (optional, default "*"), callback (optional), namespaces (optional)

For each element matched by css/xpath selector returns the dictionary containing the data extracted by the parsers listed in children parameter. All parsers listed in children parameter must have name specified - this is then used as the key in dictionary.

Typical use case for this parser is when you want to parse structured data, e.g. list of user profiles, where each profile contains fields like name, address, etc. Use Group parser to group the fields of each user profile together.

If callback is specified, it is called with the dictionary of parsed children values.

Example:

>>> from xextract import Group
>>> content = '<ul><li id="id1">michal</li> <li id="id2">peter</li></ul>'

>>> Group(css='li', count=2, children=[
...     String(name='id', xpath='self::*', count=1, attr='id'),
...     String(name='name', xpath='self::*', count=1)
... ]).parse(content)
[{'name': 'michal', 'id': 'id1'},
 {'name': 'peter', 'id': 'id2'}]

Prefix

Parameters: css / xpath (optional, default "self::*"), children (required), namespaces (optional)

This parser doesn't actually parse any data on its own. Instead you can use it, when many of your parsers share the same css/xpath selector prefix.

Prefix parser always returns a single dictionary containing the data extracted by the parsers listed in children parameter. All parsers listed in children parameter must have name specified - this is then used as the key in dictionary.

Example:

# instead of...
>>> String(css='#main .name').parse(...)
>>> String(css='#main .date').parse(...)

# ...you can use
>>> from xextract import Prefix
>>> Prefix(css='#main', children=[
...   String(name="name", css='.name'),
...   String(name="date", css='.date')
... ]).parse(...)

Parser parameters

name

Parsers: String, Url, DateTime, Date, Element, Group

Default value: None

If specified, then the extracted data will be returned in a dictionary, with the name as the key and the data as the value.

All parsers listed in children parameter of Group or Prefix parser must have name specified. If multiple children parsers have the same name, the behavior is undefined.

Example:

# when `name` is not specified, raw value is returned
>>> String(css='span', count=1).parse('<span>Hello!</span>')
'Hello!'

# when `name` is specified, dictionary is returned with `name` as the key
>>> String(name='message', css='span', count=1).parse('<span>Hello!</span>')
{'message': 'Hello!'}

css / xpath

Parsers: String, Url, DateTime, Date, Element, Group, Prefix

Default value (xpath): "self::*"

Use either css or xpath parameter (but not both) to select the elements from which to extract the data.

Under the hood css selectors are translated into equivalent xpath selectors.

For the children of Prefix or Group parsers, the elements are selected relative to the elements matched by the parent parser.

Example:

Prefix(xpath='//*[@id="profile"]', children=[
    # equivalent to: //*[@id="profile"]/descendant-or-self::*[@class="name"]
    String(name='name', css='.name', count=1),

    # equivalent to: //*[@id="profile"]/*[@class="title"]
    String(name='title', xpath='*[@class="title"]', count=1),

    # equivalent to: //*[@class="subtitle"]
    String(name='subtitle', xpath='//*[@class="subtitle"]', count=1)
])

count

Parsers: String, Url, DateTime, Date, Element, Group

Default value: "*"

count specifies the expected number of elements to be matched with css/xpath selector. It serves two purposes:

  1. Number of matched elements is checked against the count parameter. If the number of elements doesn't match the expected countity, xextract.parsers.ParsingError exception is raised. This way you will be notified, when the website has changed its structure.
  2. It tells the parser whether to return a single extracted value or a list of values. See the table below.

Syntax for count mimics the regular expressions. You can either pass the value as a string, single integer or tuple of two integers.

Depending on the value of count, the parser returns either a single extracted value or a list of values.

Value of count Meaning Extracted data
"*" (default) Zero or more elements. List of values
"+" One or more elements. List of values
"?" Zero or one element. Single value or None
num

Exactly num elements.

You can pass either string or integer.

num == 0: None

num == 1: Single value

num > 1: List of values

(num1, num2)

Number of elements has to be between num1 and num2, inclusive.

You can pass either a string or 2-tuple.

List of values

Example:

>>> String(css='.full-name', count=1).parse(content)  # return single value
'John Rambo'

>>> String(css='.full-name', count='1').parse(content)  # same as above
'John Rambo'

>>> String(css='.full-name', count=(1,2)).parse(content)  # return list of values
['John Rambo']

>>> String(css='.full-name', count='1,2').parse(content)  # same as above
['John Rambo']

>>> String(css='.middle-name', count='?').parse(content)  # return single value or None
None

>>> String(css='.job-titles', count='+').parse(content)  # return list of values
['President', 'US Senator', 'State Senator', 'Senior Lecturer in Law']

>>> String(css='.friends', count='*').parse(content)  # return possibly empty list of values
[]

>>> String(css='.friends', count='+').parse(content)  # raise exception, when no elements are matched
xextract.parsers.ParsingError: Parser String matched 0 elements ("+" expected).

attr

Parsers: String, Url, DateTime, Date

Default value: "href" for Url parser. "_text" otherwise.

Use attr parameter to specify what data to extract from the matched element.

Value of attr Meaning
"_text" Extract the text content of the matched element.
"_all_text" Extract and concatenate the text content of the matched element and all its descendants.
"_name" Extract tag name of the matched element.
att_name

Extract the value out of att_name attribute of the matched element.

If such attribute doesn't exist, empty string is returned.

Example:

>>> from xextract import String, Url
>>> content = '<span class="name">Barack <strong>Obama</strong> III.</span> <a href="/test">Link</a>'

>>> String(css='.name', count=1).parse(content)  # default attr is "_text"
'Barack  III.'

>>> String(css='.name', count=1, attr='_text').parse(content)  # same as above
'Barack  III.'

>>> String(css='.name', count=1, attr='_all_text').parse(content)  # all text
'Barack Obama III.'

>>> String(css='.name', count=1, attr='_name').parse(content)  # tag name
'span'

>>> Url(css='a', count='1').parse(content)  # Url extracts href by default
'/test'

>>> String(css='a', count='1', attr='id').parse(content)  # non-existent attributes return empty string
''

callback

Parsers: String, Url, DateTime, Date, Element, Group

Provides an easy way to post-process extracted values. It should be a function that takes a single argument, the extracted value, and returns the postprocessed value.

Example:

>>> String(css='span', callback=int).parse('<span>1</span><span>2</span>')
[1, 2]

>>> Element(css='span', count=1, callback=lambda el: el.text).parse('<span>Hello</span>')
'Hello'

children

Parsers: Group, Prefix

Specifies the children parsers for the Group and Prefix parsers. All parsers listed in children parameter must have name specified

Css/xpath selectors in the children parsers are relative to the selectors specified in the parent parser.

Example:

Prefix(xpath='//*[@id="profile"]', children=[
    # equivalent to: //*[@id="profile"]/descendant-or-self::*[@class="name"]
    String(name='name', css='.name', count=1),

    # equivalent to: //*[@id="profile"]/*[@class="title"]
    String(name='title', xpath='*[@class="title"]', count=1),

    # equivalent to: //*[@class="subtitle"]
    String(name='subtitle', xpath='//*[@class="subtitle"]', count=1)
])

namespaces

Parsers: String, Url, DateTime, Date, Element, Group, Prefix

When parsing XML documents containing namespace prefixes, pass the dictionary mapping namespace prefixes to namespace URIs. Use then full name for elements in xpath selector in the form "prefix:element"

As for the moment, you cannot use default namespace for parsing (see lxml docs for more information). Just use an arbitrary prefix.

Example:

>>> content = '''<?xml version='1.0' encoding='UTF-8'?>
... <movie xmlns="http://imdb.com/ns/">
...   <title>The Shawshank Redemption</title>
...   <year>1994</year>
... </movie>'''
>>> nsmap = {'imdb': 'http://imdb.com/ns/'}  # use arbitrary prefix for default namespace

>>> Prefix(xpath='//imdb:movie', namespaces=nsmap, children=[  # pass namespaces to the outermost parser
...   String(name='title', xpath='imdb:title', count=1),
...   String(name='year', xpath='imdb:year', count=1)
... ]).parse(content)
{'title': 'The Shawshank Redemption', 'year': '1994'}

HTML vs. XML parsing

To extract data from HTML or XML document, simply call parse() method of the parser:

>>> from xextract import *
>>> parser = Prefix(..., children=[...])
>>> extracted_data = parser.parse(content)

content can be either string or unicode, containing the content of the document.

Under the hood xextact uses either lxml.etree.XMLParser or lxml.etree.HTMLParser to parse the document. To select the parser, xextract looks for "<?xml" string in the first 128 bytes of the document. If it is found, then XMLParser is used.

To force either of the parsers, you can call parse_html() or parse_xml() method:

>>> parser.parse_html(content)  # force lxml.etree.HTMLParser
>>> parser.parse_xml(content)   # force lxml.etree.XMLParser