Convert a HTML webpage to JSON data using a template defined in JSON.
This package is available on PyPi. Just use pip install -U html2json
to install it. Then you can import it using from html2json import collect
.
The method is collect(html, template)
. html
is the HTML of page loaded as string, and template
is the JSON of template loaded as Python objects.
Note that the HTML must contain the root node, like <html>...</html>
or <div>...</div>
.
- The basic syntax is
keyName: [selector, attr, [listOfRegexes]]
.selector
is a CSS selector (supported by lxml).- When the selector is
null
, the root node itself is matched. - When the selector cannot be matched,
null
is returned.
- When the selector is
attr
matches the attribute value. It can benull
to match either the inner text or the outer text when the inner text is empty.- The list of regexes
[listOfRegexes]
supports two forms of regex operations. The operations with in the list are executed sequentially.- Replacement:
s/regex/replacement/g
.g
is optional for multiple replacements. - Extraction:
/regex/
.
- Replacement:
For example:
{
"Color": ["head link:nth-of-type(1)", "href", ["/\\w+(?=\\.css)/"]],
}
- As JSON, nested structure can be easily constructed.
{
"Cover": {
"URL": [".cover img", "src", []],
"Number of Favorites": [".cover .favorites", "value", []]
},
}
- An alternative simplified syntax
keyName: [subRoot, subTemplate]
can be used.subRoot
a CSS selector of the new root for each sub entry.subTemplate
is a sub-template for each entry, recursively.
For example, the previous example can be simplified as follow.
{
"Cover": [".cover", {
"URL": ["img", "src", []],
"Number of Favorites": [".favorites", "value", []]
}],
}
- To extract a list of sub-entries following the same sub-template, the list syntax is
keyName: [[subRoot, subTemplate]]
. Please note the difference (surrounding[
and]
) from the previous syntax above.subRoot
is the CSS selector of the new root for each sub entry.subTemplate
is the sub-template for each entry, recursively.
For example:
{
"Comments": [[".comments", {
"From": [".from", null, []],
"Content": [".content", null, []],
"Photos": [["img", {
"URL": ["", "src", []]
}]]
}]]
}