fb55/htmlparser2

Use high5 as a new tokenizer

fb55 opened this issue · 10 comments

fb55 commented

Lately, a lot of tokenization-related bugs have popped up, and even though the tree-building part of high5 isn't done, its tokenizer should be ready.

This will be the 4.0.0 release of this module and will break some code – especially since a new doctype callback will be introduced and XML declarations (eg. <?xml …>) inside HTML documents will be handled as comments.

On the plus side, this means that we've got a spec compliant tokenizer, so all tokenization bugs can be pointed to the spec.

This sounds awesome.

When will this be available?

fb55 commented

The tokenizer currently lacks support for positions. As soon as that's added, a new version will become available. I have no idea when I'll have the time & be motivated to do it, so I can't give a timetable or anything.

Is there any update on this?

@HoldYourWaffle I'm not on the team, but parse5 is now the default parser.

@stevenvachon The default parser for what?

fb55 commented

That would be cheerio. htmlparser2 is still shipped with the project and is used as the default parser when xmlMode: true

We can probably close this?

fb55 commented

Closing this as htmlparser2 should just keep its existing tokenizer.