Parsing fails in some websites
Appress opened this issue · 3 comments
Hi there,
We are using linkedom in our project, which relies on htmlparser2.
Parsing fails for some pages with the recent versions of htmlparser2 ( eg. this article )
To replicate, open the generated html in a browser
const document = (new DOMParser).parseFromString(htmlFromTheArticle, 'text/html');
const html = document.body.innerHTML;
Instead of the original page, it now includes raw html code.
Διαβάστε το πλήρες κείμενο του σημειώματος του CEO της UBS στο <a href="https://www.newmoney.gr/roh/bloomberg/to-esoteriko-simioma-tou-ceo-tis-ubs-pros-tous-ergazomenous-meta-tin-exagora-tis-credit-suisse/" target="_blank" rel="noopener noreferrer">newmoney.gr</a><br> <br> <strong><a href="https://www.protothema.gr/oles-oi-eidiseis/" target="_blank" rel="noopener noreferrer">Ειδήσεις σήμερα:</a><br> <br> <a href="https://www.protothema.gr/greece/article/1351532/xanthi-ston-eisaggelea-simera-o-36hronos-pou-skotose-ton-45hrono-epeidi-ton-theorise-roufiano/" target="_blank" rel="noopener noreferrer">...
It happened for many html documents already. If I downgrade to htmlparser2 v6.1.0, it works properly.
I tried to debug and the problem is caused in Tokenizer.ts. When I simply replace these lines
if (this.isSpecial) {
this.state = State.InSpecialTag;
this.sequenceIndex = 0;
} else {
this.state = State.Text;
}
With
this.state = State.Text;
It works properly. I'm not sure what is the proper fix, which will not affect the performance of htmlparser2, so I opened this issue instead.
Hi @Appress, thanks for opening this issue! Could you provide a bit of markup that is parsed differently between parser versions? There isn't anything here that stands out to me, and getting the relevant snipped would be a great help.