andris9/NodePie

decoding of invalid HTML entities replaces arbitrary chunks of content with a null character

Closed this issue · 3 comments

By default, nodepie decodes HTML entities in every(?) string that it returns through getContents(),getDescription() etc. I can't think of any reason why you would want to do that, because the XML parser will decode those fields when you parse the XML document. Nodepie then tries to decode them again, and if it encounters any ampersands, it assumes it is the beginning of an HTML entity. It will read until the next semicolon and then try to decode whatever it read. This is garbage, of course, so it doesn't decode properly, and is replaced with a null character instead. Then parsing resumes from where the semicolon was... so we have lost all the content between the ampersand and the semicolon, and we've replaced it with a null, which is not valid UTF8.

The workaround: turn on the option "keepHTMLEntities", which resolves the issue. Why isn't this on by default? Even if it was on by default, the behavior for decoding invalid HTML entities still needs to be fixed.

fb55 commented

This fault is probably based on one of my changes, I'm sorry for that. I would support the suggestion to disable char decoding by default, as it's the parsers job. Besides, an external library (as node-ent) should be used for char decoding, so that the code loses some weight (and the errors are someone else's job).

Hi, you are right that it should be off by default but currently I have already a lot of projects using it this way and I do not want to break anything, so I guess it is going to stay this way.

I've already forked NodePie to make a couple enhancements that eventually
I'd like you to pull (I'm going to add a way to tell if distinct
descriptions and contents are provided, among other minor changes; they're
not done yet). I'll be sure not to change the default for keepHTMLEntities
so you can pull without breaking your existing projects. Are you certain
your existing projects aren't broken already? Every feed I've yet tried
the library on has been correctly decoded with keepHTMLEntities == true,
and some small subset has been incorrectly decoded with keepHTMLEntities ==
false. This is because the bug only manifests on feed items that contain
an ampersand in their content/description/title/etc, which is somewhat rare.

Even with this small issue, NodePie is an excellent library and has saved
me a lot of time. Thank you very much for creating it, and I look forward
to being able to contribute in the near future.

On Thu, Dec 8, 2011 at 9:12 AM, Andris Reinman <
reply@reply.github.com

wrote:

Hi, you are right that it should be off by default but currently I have
already a lot of projects using it this way and I do not want to break
anything, so I guess it is going to stay this way.


Reply to this email directly or view it on GitHub:
#9 (comment)