cheeriojs/cheerio

Automatically adding <html><head><body> ?

Closed this issue ยท 33 comments

OSX 10.12.4
Node 8.0
"cheerio": "1.0.0-rc.1"

Not sure if this is expected, but:

let $ = cheerio.load('<div>Hello</div>');
console.log($.html());

// <html><head></head><body><div>Hello</div></body></html>

While in my case the auto-wrapping as an <html> document is weirdly useful, I was still surprised to see it. Is this a bug, or a feature? I don't see a .fragment() method or anything like that. Maybe just missed something.

Thanks!

Yup, this is a breaking change for version 1.0. We're working on documentation here: #1027

Oh, thanks.

Re: performance here: 6aae084

This switch addresses many long-standing bugs in Cheerio, but some users may
experience slower behavior in performance-critical applications.

Just so I understand, is this saying that setting useHtmlParser2 to true will often result in better performance than not doing so?

And more specifically, if my operation is always modifying a small fragment (something like $('<div>hi</div>').append(...)) which is the best parser to choose in terms of performance?

Thanks for the great library.

My apologies; the useHtmlParser2 option is not considered "public" API and
should not be relied upon. The latest version of the documentation (currently
under review at gh-1033) includes an updated recommendation for using
htmlparser2 with Cheerio.

When it comes to performance in any specific scenario, the best way to verify
is to run some trials locally.

My apologies; the useHtmlParser2 option is not considered "public" API and should not be relied upon.

Would be great to add more tests extracted from valid use cases to make sure users don't end up locked with < 1.0. The number of changed tests suggest that a lot of apps will break with 1.0.

Just posted one example test here. #1029 (comment)

I also deal with fragments, but I could just as easily work with full documents. If it was consistently fragments then I could just grab the elements in <body>. If it was consistently full documents then there's potentially no change, though I may need to remove a <head> or <body> element in order to get back to what was supplied.

@jugglinmike suggests using htmlparser2, but #1033 says that using Parse5 eliminates many long-standing bugs. I also want fewer bugs. I think I can do any of the following:

  • Use htmlparser2 and enjoy any bugs that are discovered by using that library.
  • Use Parse5 and make sure I am only dealing with full documents, thus making a breaking change to my plugins.
  • Use Parse5 and perform whatever necessary magic to get the document or fragment, depending on what I am passed. This sounds very error-prone.

Are there options beyond what I've listed above? Do you have advice for me to update my software? Right now I'm pinning to 0.x just so people who install my module don't get bombarded with problems.

@fidian As per the conversation in #985 (comment) the new parser does not implement all the features in the previous one. For example, preserving entities or normalizing whitespace. For now, it's probably safer to pin 0.x as you suggested.

Please see my comments in #1039 - from my point of view, this breaks Cheerio.

How can I reliably get the root element of the string being parsed, when that string can be either <template>test</template> or <div>test</div>?

@xymbol cheerio is largely a scraper, certainly not an optimization tool. Try htmlnano. Entities are parsed the same as a browser.

@fidian working with fragments and full documents is no different than working with jQuery.

@stevenvachon Thanks, I know the problem domain that cheerio covers. My complain is about having a version out that makes the welcome example obsolete. I understand breaking changes on a 1.0 but, why not communicating those better? Documentation is UX for developers.

And no, htmlnano does not apply to what I do with cheerio: mostly acceptance tests and some document fragment manipulation. I don't want/need to remove nor decode entities, just make sure the parser leaves them as is.

The readme needs to be updated. 1.0rc1 is not 1.0

The parser does what it needs to do in order to be spec compliant. What cheerio offered before was inaccurate and wrong.

@stevenvachon You seem to have a solid grasp of the goal of this project, which is good. I'm confused by your message to me:

@fidian working with fragments and full documents is no different than working with jQuery.

Could you show me what you mean using a code example? Here's mine.

// jQuery in browser on jquery.com
jQuery("<div>Hi</div>").each((index, element) => { console.log(element.nodeName); })
// Result: DIV
jQuery("<html></html>").each((index, element) => { console.log(element.nodeName); })
// Result: HTML

The closest thing I can find in functionality is the following for Cheerio.

// Cheerio 1.0.0-rc1
cheerio.load("<div>Hi</div>")("*").each((index, element) => { console.log(element.name); });
// Result: html head body div
cheerio.load("<div>Hi</div>")("*").each((index, element) => { console.log(element.name); });
// Result: html head body

Those extra elements are added when dealing with fragments only in Cheerio. I don't understand the assertion that Cheerio is no different than jQuery when dealing with both full documents and fragments.

load() does not exist in jQuery.

I realize that load does not exist in jQuery. The quickest path in jQuery to parse a string and get back a list of elements is calling jQuery("string goes here"). The matching command in Cheerio looks like cheerio.load("string goes here")(selectorGoesHere). Do you have an alternative that I should try besides the comment you linked? (More on that later.)

I also deal with fragments, but I could just as easily work with full documents. If it was consistently fragments then I could just grab the elements in . If it was consistently full documents then there's potentially no change, though I may need to remove a or element in order to get back to what was supplied.

I realize that my explanation of the situation does not explain that I do not know if I am receiving a full document, malformed document, nor a template. Your comment indicates that I should treat them differently. That's difficult to do without parsing the string first and figuring out what the first few elements are. Similarly, I may be parsing something that only has <html></html> and it shouldn't get the head nor body elements added. Would you consider that to be a full document or a template? I wouldn't consider it either one. The issue you cited doesn't seem to work in all examples.

cheerio = require("cheerio");
$ = cheerio.load("");
body = $("body");
body.html("<html></html>").html();
// ''
body.html("<head><title>Test</title></head>").html()
// '<title>Test</title>'
body.html("<html><head><title>Test</title></head><body><p>Content!</p></html>").html()
// '<title>Test</title><p>Content!</p>'

I fail to see how this is a drop-in replacement for the 0.x series of Cheerio.

Your comparison is incorrect. jQuery(selectorOrHtml) would be the same as cheerio.load(documentHtml)(selectorOrHtml). jQuery is available after the document element is available while cheerio is available before.

Cheerio should now be consistent with a web browser, and does what you expect if you are familiar with how HTML really works. Major version changes are never intended to be "drop-in replacements". If they were, they would've been a minor version.

Regarding your example, if you are appending <html></html> to <body> and expect that to be returned with no changes, you are technically not parsing as HTML. You aren't even parsing as XML. If you're parsing template files that contain only <head><title>Test</title></head>, then you'll want to use the old htmlparser2:

const $ = cheerio.load("", { useHtmlParser2:true });

per #985

What long-standing bugs are in Cheerio that would be exhibited when using the older parser?

@fidian you aren't reading the links I send you. See #985

Yes, I did see that issue and how it linked to several issues. I've read it multiple times and I've scanned over the linked issues. It isn't that I was lazy and I'm trying to get out of figuring this out myself. Instead, my question is not answered by the materials I covered. I guess I was looking for more of a summary of changes that the new parser fixes instead of a bulleted list of issues that I have to follow individually and scan through the bug report. Maybe, I was hoping that a changelog would list something like the following. I'm including it as code so I don't link to each of the issues and, if someone here thinks it is a good idea, this could be added to the changelog so others understand the list of things that are affected.

* #863 - Set parse5 as the default parser (not a bug).
* #860, #522, #997 - Correct parsing of `<` within HTML when it should not be translated into a tag.
* #126 - Fix entering "bogus comment state" when it encounters CDATA tags; possibly already fixed by htmlparser2.
* #240 - Fix issues with deeply nested elements and recursion.
* #405 - Parse documents that do not have `<head>` elements.
* #422, #694, possibly #826, possibly #971 - Magically add tags where HTML spec requires them, such as inserting a `<tr>` element around `<td>` cells or adding closing tags when listing multile `<li>` elements in a row.
* #907 - Data URIs are now parsed correctly in inline style attributes.
* #937 - Strangely duplicating some content within an element during a broken HTML encoded character.

I don't see how #746 relates, unless you're automatically turning on XML mode or something. The fix in that issue looks like set some configuration before parsing. Also, I don't know about #915 because I would imagine that the new parser would automatically close the <h2> tag. I figure that this particular issue was made worse or didn't see any improvement with the switch of the parser.

As for #971, I was thinking that lies more within the domain of Cheerio itself and not the HTML parser, but I could easily be mistaken.

What isn't mentioned by the pull request is what bugs are attributed solely to htmlparser2 and what bugs are addressed by fixes in Cheerio itself. The most useful thing I could receive as an answer is a message indicating exactly what issues would remain if I retain htmlparser2 as my parsing library, which I don't see covered explicitly in #985. Did I overlook this information somewhere? Is there a summary in existence that's better than what I have included in this post?

@fidian pretty well everything listed here is performed correctly with parse5, unlike htmlparser2.

I'm positive that I'm not explaining myself well because pointing me at the HTML spec doesn't come close to answering my questions. Please allow me to try again.

I'm not seeking the merits of parse5. I simply would like an explanation of things that would remain broken and the scenarios that the Cheerio team encountered with htmlparser2. As I understand it, htmlparser2 is faster and is a viable option as long as people are willing to live with the bugs. What I don't really have a handle on is the types of bugs one encounters with that library and how prevalent they are.

When other people upgrade, I see several routes available for them.

  1. They always parse full documents.
    1. The user prefers speed: use htmlparser2 and suffer with some bugs.
    2. Speed is not critical: use parse5.
  2. They always parse fragments that are able to be embedded in the <body>.
    1. Speed concerns: htmlparser2 and you get to possibly experience bugs.
    2. Validity or no real concerns either way: use parse5 but employ the workaround mentioned earlier in order to not get the <html> and <body> elements.
  3. The user doesn't get a lot of say in what they are parsing or their code can be used with either documents or fragments or even very loose HTML/XML/whatever.
    1. Use htmlparser2 and you live with the bugs.
    2. Consider switching away from Cheerio. (Probably not an option for the two other major scenarios.)

When an upgrade guide is written for Cheerio, the user will have to use some specific information about what's fed into Cheerio in order to make an informed decision. All three of these scenarios have htmlparser2 as a valid option, but there's a slight caveat in that the dev team is warning you that there's bugs in htmlparser2. These bugs are experienced by users of Cheerio and it would be nice to provide some warnings and specific examples of where htmlparser fails. Maybe contrast the two libraries, saying one is slower, more robust with invalid HTML, and it adds/shuffles nodes whenever it feels like it for fragments. The other is faster, but it takes shortcuts and invalid HTML makes an invalid DOM node tree, but elements in fragments are generally left alone and not "tweaked".

I'm not trying to say Cheerio is full of bugs. I'm ok with the switch to parse5 (well, as much as I can now that I have to look for another library or employ some trickery to do what I want). The goal of my feedback is for others. Those poor users who upgrade to 1.0 and are shocked by the changes. They shoul be able to find information about the potential outcome of their choice of parsing libraries when they upgrade to version 1.0.0.

Also, please don't just link to https://github.com/fb55/htmlparser2/issues. My question is more around documentation for Cheerio. If the user can be instructed to use htmlparser2 instead of parse5, what would they experience that works poorly? One could just as easily link to https://github.com/inikulin/parse5/issues and say that parse5 has bugs too, so just use the library that works best for you.

Because both libraries are supported officially by Cheerio, a succinct comparison page is much more aligned with what I am trying to seek. Right now I'm seeing stuff like "you can use parse5 or htmlparser2, but don't use htmlparser2 because it has a lot of bugs" and that's not informative enough. Seems silly to support htmlparser2 when you don't explain why it's still an official option.

Also, please don't just link to https://github.com/fb55/htmlparser2/issues. My question is more around documentation for Cheerio.

Reminds me a lot of this longtime post. http://tom.preston-werner.com/2010/08/23/readme-driven-development.html

Users have been getting 1.0.0-rc.1 on every npm install cheerio for weeks and, the readme begins with an example that does not work. This has been reported, accepted and closed.

Would you accept a pull-request adding a line to the readme to save users installing or upgrading, time and headaches?

@xymbol

Would you accept a pull-request adding a line to the readme to save users installing or upgrading, time and headaches?

I bet maintainers would love to accept comments in #1033 that already have an ongoing work on README.

Also, please don't just link to https://github.com/fb55/htmlparser2/issues. My question is more around documentation for Cheerio. If the user can be instructed to use htmlparser2 instead of parse5, what would they experience that works poorly? One could just as easily link to https://github.com/inikulin/parse5/issues and say that parse5 has bugs too, so just use the library that works best for you.

parse5 parses HTML per language specification, i.e. like browsers do. It's always parses HTML correctly. The drawback is speed (~2x slower). htmlparser2 is extremely simplified implementation of the parsing algorithm and can handle correctly only very limited subset of HTML (even wellformed).

Hi folks,

Thanks for taking the time to work through all this. I've updated the branch
for the next release candidate with an attempt to address your concerns:

#1033

If you have a moment, I'd appreciate your input there.

Yeah I've been getting stuff like <div class="content"><html><head></head><body>Something</body></html></div> on my website :D So the best solution seems to be just $('body').html() instead of $.html().

Use xmlMode: true to avoid auto-wrapping:

const parsed = cheerio.load(content, {
  xmlMode: true
});

Give a switch? wrapHTML: true/false?

Template don't need <html><head></head><body></body></html> wrap...

Use xmlMode: true to avoid auto-wrapping:

const parsed = cheerio.load(content, {
  xmlMode: true
});

The code in script tag may be modified in this way.

Use xmlMode: true to avoid auto-wrapping:

That breaks HTML entities, such as &nbsp;

fb55 commented

To give an answer here: The load method has a third argument, isDocument. If you set this to false, your document will be parsed in fragment mode and no <html> root tag (or other boilerplate) will be inserted. To use the example from the original issue:

let $ = cheerio.load('<div>Hello</div>', null, false);
console.log($.html());

// "<div>Hello</div>"
dpw1 commented

@fb55

I can confirm that this solution works for me, with HTML, HTML/Liquid and Handlebars.

Thank you!