Use high5 as a new tokenizer
fb55 opened this issue · 10 comments
Lately, a lot of tokenization-related bugs have popped up, and even though the tree-building part of high5 isn't done, its tokenizer should be ready.
This will be the 4.0.0
release of this module and will break some code – especially since a new doctype
callback will be introduced and XML declarations (eg. <?xml …>
) inside HTML documents will be handled as comments.
On the plus side, this means that we've got a spec compliant tokenizer, so all tokenization bugs can be pointed to the spec.
This sounds awesome.
+5
When will this be available?
The tokenizer currently lacks support for positions. As soon as that's added, a new version will become available. I have no idea when I'll have the time & be motivated to do it, so I can't give a timetable or anything.
Is there any update on this?
@HoldYourWaffle I'm not on the team, but parse5 is now the default parser.
@stevenvachon The default parser for what?
That would be cheerio. htmlparser2
is still shipped with the project and is used as the default parser when xmlMode: true
We can probably close this?
Closing this as htmlparser2 should just keep its existing tokenizer.