html5lib/html5lib-python

Add position information for text nodes

corynezin opened this issue · 0 comments

Would it be possible to add position information, i.e. line+column to text nodes? Or, at least make this information available to the tree builder? I implemented a very minimal proof of concept to add the information to each token and pass that along to the dom tree builder and obtain the following result:

import html5lib

html = '<div>&amp;<p>b<span>c</span></p> cab</div>'

parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("dom"))

doc = parser.parse(html)
def parse(n):
    for c in n.childNodes:
        if hasattr(c, 'sourcepos'):
            print(c.sourcepos, c)
        parse(c)

parse(doc)
None <DOM Element: head at 0x10bbed0d0>
None <DOM Element: body at 0x10bbed1f0>
(1, 5) <DOM Element: div at 0x10bbfb790>
(1, 10) <DOM Text node "'&'">
(1, 13) <DOM Element: p at 0x10bbfb820>
(1, 14) <DOM Text node "'b'">
(1, 20) <DOM Element: span at 0x10bbfb8b0>
(1, 21) <DOM Text node "'c'">
(1, 33) <DOM Text node "' '">
(1, 36) <DOM Text node "'cab'">

I would be willing to implement it.