WHParallelParser (the WikiCode&Html Parallel Parser) is a Python package that provides a parser for Wikipedia Html and WikiCode.
The WHParallelParser parser uses the parallel markup approach, the raw text data and the formatting information are kept separately. Each tag contains information about its position and length in the document. This format has many advantages over traditional embedded markup, and can be use in machine learning.
File input types for the parser:
- HyperText Markup Language (HTML)
- WikiCode
The WHParallelParser is avaliable through Pip Installs Packages. You can install the latest release with command line
pip install WHParallelParser
Parser initialization, we should specify cache_folder
from WHParallelParser import WHParallelParser
from WHParallelParser import Iterator
from WHParallelParser import Rule
cache_folder = "./cached"
parser = WHParallelParser(cache_folder=cache_folder)
After parser initialization, we can use function parse_html
output = parser.parse_html("<div>test<h1>test</div>")
print(output.text)
print(output.data)
Example output.text output
['test', 'test']
Example output.data output
[
{
"tag": "h1",
"attrs": {},
"start": [
1,
0
],
"end": [
1,
3
]
},
{
"tag": "div",
"attrs": {},
"start": [
0,
0
],
"end": [
2,
-1
]
}
]
After parser initialization, we can use function parse_wikicode
output = parser.parse_wikicode("[[link]]")
print(output.text)
print(output.data)
Example output.text output
['link']
Example output.data output
[
{
"tag": "a",
"attrs": {
"href": "/wiki/Link",
"class": [
"mw-disambig"
],
"title": "Link"
},
"start": [
0,
0
],
"end": [
0,
3
]
},
{
"tag": "p",
"attrs": {},
"start": [
0,
0
],
"end": [
0,
3
]
},
{
"tag": "div",
"attrs": {
"class": [
"mw-parser-output"
]
},
"start": [
0,
0
],
"end": [
1,
-1
]
}
]
We can iterate over simplewiki json file
path_to_simplewiki = "example_simplewiki.json/simplewiki.ndjson"
for html in Iterator(path_to_simplewiki):
print(html.text)
We can add some rules
parser.rules.add_rule(Rule({"name": "a", "class": "mw-disambig"}))
output = parser.parse_wikicode("[[link]] '''test'''")
print(output.text)
print(output.data)
Example output.text output
['test']
Example output.data output
[
{
"tag": "b",
"attrs": {},
"start": [
0,
1
],
"end": [
0,
4
]
},
{
"tag": "p",
"attrs": {},
"start": [
0,
0
],
"end": [
0,
4
]
},
{
"tag": "div",
"attrs": {
"class": [
"mw-parser-output"
]
},
"start": [
0,
0
],
"end": [
1,
-1
]
}
]
We can remove some rules
parser.rules.remove_rule(Rule({"name": "a", "class": "mw-disambig"}))
output = parser.parse_wikicode("[[link]] '''test'''")
print(output.text)
print(output.data)
Example output.text output
['link test']
Example output.data output
[
{
"tag": "a",
"attrs": {
"href": "/wiki/Link",
"class": [
"mw-disambig"
],
"title": "Link"
},
"start": [
0,
0
],
"end": [
0,
3
]
},
{
"tag": "b",
"attrs": {},
"start": [
0,
5
],
"end": [
0,
8
]
},
{
"tag": "p",
"attrs": {},
"start": [
0,
0
],
"end": [
0,
8
]
},
{
"tag": "div",
"attrs": {
"class": [
"mw-parser-output"
]
},
"start": [
0,
0
],
"end": [
1,
-1
]
}
]
Each wiki link is defined by the tag with the following structure:
{
'tag': 'link',
'attrs': {
'href': string,
'class': [string],
...
},
'start': [int, int],
'end': [int, int]
}
The project contains unit tests that checks if the parser works as expected. To execute all the tests run the following command in the project root dictionary:
pytest ./tests
To execute a specific test suite:
pytest ./tests/test_wikilink.py
The Three Musketeers | MMK_team
- Maciek
- Miłosz
- Kajetan
We are students from Gdańsk University of Technology