Jsonify HTML is a general-proposed, template-based parser to transform HTML to reusable JSON data. It can be thought as a reverse engine of template-based renders, e.g. Jinja.
🎉🎉 Coming Soon: Jsonify HTML v2 🎉🎉 include statement, pipe, sandbox, reference, simple script and all fruitful features, all make Jsonify HTML a full-fledged parser!
A preview of new syntax (in YAML format):
# version: "2.0" # buildin settings
$i: 0 # define variables
$post:
Object: # anonymous object
init(): # init() function executes before parsing
- clean()
String title: # type annotation
parse(): # parse() function converts HTML to a JSON entry
- select_one(.title) # commands are functions ...
- inner_text(strip=True) # ... and can be called like those in Python
Datetime date:
- select_one(.date) # parse() can be omitted for simple entries
- inner_text(strip=True)
String author:
- select_one(.author)
- inner_text(strip=True)
Uri url:
- select_one(.//a/@href) # xpath is fully supported
Integer id:
- eval($i) # reference to a variable
increment_id(): # define a method called increment_id
- exec($i += 1)
final(): # final() function executes after parsing
- increment_id() # methods are commands, too
List[Object]: # nested type annotation
parse():
- select_one(article)
- inner_html()
- clean()
- select(.post)
- foreach(e -> apply(e, $post)) # lambda without side-effectsAll we need is a HTML document,
<div class="text">Hello World</div>and a template to parse it,
{
"$type": "str",
"$cmd": [ ["select_one", ".text"], ["inner_text"] ]
}$type specifies the data type of final objects, $cmd contains a list of instructions to tell the parser to select_one DOM named .text and extract its inner_text. So we have,
"Hello World"Python 3.7+
pip install --upgrade jsonify_htmlOur goal is to extract useful information in a HTML document. It's extremely useful when we DON'T have the original data, for example, the webpage you scraped from a website. Here is a page from James Potter's blog:
<html>
<head>
<title> an awesome blog </title>
</head>
<body>
<!-- other DOMs -->
<div class="post-list">
<div class="post">
<p class="title">How to convert HTML to JSON ?</p>
<p class="meta">
<span class="author">James Potter</span>
<span class="date">2020-10-02</span>
<span class="comments">20</span>
</p>
<p class="preview">Using <em>JSONify HTML</em> ! ...</p>
<div class="image-container">
<img src="post20201001_head.png">
</div>
</div>
<div class="post">
<p class="title">Hello World</p>
<p class="meta">
<span class="author">James Potter</span>
<span class="date">2020-10-01</span>
<span class="comments">5</span>
</p>
<p class="preview">I opened a blog! ...</p>
<div class="image-container">
<img src="post20201001_head.png">
</div>
</div>
</div>
<!-- other DOMs -->
</body>
</html>We want well-structured data, presented as an object in Python, or JSON for elsewhere. Let's try to write a template:
// main.json
{
"$type": "list",
"$cmd": [
["select_one", ".post-list"],
["recursive"]
]
}where $type tells the parser, our final object is a list, $cmd is a list of command lines, each command line begins with the name of a command, then follows its arguments. There are two commands,
-
select_oneextractsdiv.post-listfrom our webpage, you may use either CSS or XPATH selectors. -
recursivetells the parser to automatically identify and process every DOM insidediv.post-list(which is extracted by the preceding command. ).
It works like magic! Right? All we need to do next, is to write some sub-templates to instruct the parser.
// post.json
{
"$type": "object",
"$match": ".post",
"name": "post",
"title": {
"$type": "str",
"$cmd": [ ["select_one", ".title"], ["inner_text"] ]
},
"author": {
"$type": "str",
"$cmd": [ ["select_one", ".author"], ["inner_text"] ]
},
"date": {
"$type": "datetime",
"$cmd": [ ["select_one", ".date"], ["inner_text"] ]
},
"comments": {
"$type": "int",
"$cmd": [ ["select_one", ".comments"], ["inner_text"] ]
},
"preview": {
"$type": "html",
"$cmd": [ ["select_one", ".preview"], ["inner_html"] ]
},
"image": {
"$type": "str",
"$cmd": [ ["select_one", ".//img/@src"] ]
}
}Notice: each key starts with $ indicates a keyword, it tells the parser essential instructions, otherwise, it would be the data key presented in final objects. A template can also include some sub-templates in place, for instance, title is a data key, and its value is a sub-template. Let's simply run a demo to see what we will get,
[folder structure]
+ root
|--- webpage.html
|--- main.json
|--- post.json
Here's the code:
from jsonify_html import from_package
import json
html = open('webpage.html').read()
data = from_package('main.json', html)
print(json.dumps(data, indent=2)) # prettify JSONIt does work like magic! We finally got,
// finial results
[
{
"name": "post",
"title": "How to convert HTML to JSON ?",
"author": "James Potter",
"date": "2020-10-02T00:00:00",
"comments": 20,
"preview": "<p>Using <em>JSONify HTML</em> ! ...</p>",
"image": "post20201001_head.png"
},
{
"name": "post",
"title": "Hello World",
"author": "James Potter",
"date": "2020-10-01T00:00:00",
"comments": 5,
"preview": "<p>I opened a blog! ...</p>",
"image": "post20201001_head.png"
}
]What if we want to abbreviate those templates as a single one? It's straight forward with the foreach command:
// template.json
{
"$type": "list",
"$cmd": [
["select", ".post"],
["foreach",
{
"template": {
"$type": "object",
"$match": ".post",
"name": "post",
"title": {
"$type": "str",
"$cmd": [ ["select_one", ".title"], ["inner_text"] ]
},
"author": {
"$type": "str",
"$cmd": [ ["select_one", ".author"], ["inner_text"] ]
},
"date": {
"$type": "datetime",
"$cmd": [ ["select_one", ".date"], ["inner_text"] ]
},
"comments": {
"$type": "int",
"$cmd": [ ["select_one", ".comments"], ["inner_text"] ]
},
"preview": {
"$type": "html",
"$cmd": [ ["select_one", ".preview"], ["inner_html"] ]
},
"image": {
"$type": "str",
"$cmd": [ ["select_one", ".//img/@src"] ]
}
}
}
]
]
}Note here we select all .post, instead of just one .post-list.
from jsonify_html import from_template
import json
html = open('webpage.html').read()
template = json.load(open('template.json'))
data = from_template(template, html)
print(json.dumps(data, indent=2)) # prettify JSONIt again works!