Parsing fails in some websites

Question

Parsing fails in some websites

Appress opened this issue 2 years ago · 3 comments

Hi there,
We are using linkedom in our project, which relies on htmlparser2.

Parsing fails for some pages with the recent versions of htmlparser2 ( eg. this article )

To replicate, open the generated html in a browser

    const document = (new DOMParser).parseFromString(htmlFromTheArticle, 'text/html');
    const html = document.body.innerHTML;

Instead of the original page, it now includes raw html code.

Διαβάστε το πλήρες κείμενο του σημειώματος του CEO της UBS στο <a href="https://www.newmoney.gr/roh/bloomberg/to-esoteriko-simioma-tou-ceo-tis-ubs-pros-tous-ergazomenous-meta-tin-exagora-tis-credit-suisse/" target="_blank" rel="noopener noreferrer">newmoney.gr</a> <a href="https://www.protothema.gr/oles-oi-eidiseis/" target="_blank" rel="noopener noreferrer">Ειδήσεις σήμερα:</a> <a href="https://www.protothema.gr/greece/article/1351532/xanthi-ston-eisaggelea-simera-o-36hronos-pou-skotose-ton-45hrono-epeidi-ton-theorise-roufiano/" target="_blank" rel="noopener noreferrer">...

It happened for many html documents already. If I downgrade to htmlparser2 v6.1.0, it works properly.

I tried to debug and the problem is caused in Tokenizer.ts. When I simply replace these lines

            if (this.isSpecial) {
                this.state = State.InSpecialTag;
                this.sequenceIndex = 0;
            } else {
                this.state = State.Text;
            }

With

this.state = State.Text;

It works properly. I'm not sure what is the proper fix, which will not affect the performance of htmlparser2, so I opened this issue instead.

Answer 1 · 2023-04-04T12:19:56.000Z

Hi @Appress, thanks for opening this issue! Could you provide a bit of markup that is parsed differently between parser versions? There isn't anything here that stands out to me, and getting the relevant snipped would be a great help.

Answer 2 · 2023-04-05T07:44:41.000Z

Hi @fb55, for some reason I can't reproduce it anymore. I guess I had to save the original html of the article. I will let you know, once I reproduce it again.

Answer 3 · 2023-04-05T09:22:59.000Z

@Appress No worries! Closing this issue for now, happy to reopen once we can reproduce this.