Use high5 as a new tokenizer

Question

Use high5 as a new tokenizer

fb55 opened this issue 10 years ago · 10 comments

Lately, a lot of tokenization-related bugs have popped up, and even though the tree-building part of high5 isn't done, its tokenizer should be ready.

This will be the 4.0.0 release of this module and will break some code – especially since a new doctype callback will be introduced and XML declarations (eg. <?xml …>) inside HTML documents will be handled as comments.

On the plus side, this means that we've got a spec compliant tokenizer, so all tokenization bugs can be pointed to the spec.

stevenvachon commented 9 years ago

+5

Answer 1 · 2014-11-16T13:40:15.000Z

This sounds awesome.

Answer 2 · 2015-02-23T15:04:09.000Z

When will this be available?

Answer 3 · 2015-02-23T15:54:59.000Z

The tokenizer currently lacks support for positions. As soon as that's added, a new version will become available. I have no idea when I'll have the time & be motivated to do it, so I can't give a timetable or anything.

Answer 4 · 2019-04-13T15:02:07.000Z

Is there any update on this?

Answer 5 · 2019-04-14T22:08:31.000Z

@HoldYourWaffle I'm not on the team, but parse5 is now the default parser.

Answer 6 · 2019-08-03T20:29:54.000Z

@stevenvachon The default parser for what?

Answer 7 · 2019-08-03T20:38:20.000Z

That would be cheerio. htmlparser2 is still shipped with the project and is used as the default parser when xmlMode: true

Answer 8 · 2019-08-03T21:28:52.000Z

We can probably close this?

Answer 9 · 2020-09-01T14:45:54.000Z

Closing this as htmlparser2 should just keep its existing tokenizer.