mattzeunert/FromJS

Exception at findTagContent in getHeadAndBodyContent.js

Closed this issue · 2 comments

This function has severe limitations and causes breakage on some sites (null reference exception at line 8: Cannot read property '0' of null).

  1. It doesn't account for whitespace in tags, e.g.: < body>, < / body > and others won't match.
  2. It is case-sensitive while HTML tagnames are case-insensitive
  3. It doesn't necessarily match the correct closing tag (shouldn't be a problem for the head/body use-case but then the function's name should probably be changed to reflect that, e.g.: findUniqueTagContent ).

I can submit a PR later if you want.

Those are all good points! I think another issue with that function is that it will mistake a commented out <body> tag with the real thing, since it just operates on a string.

I'm using cheerio to parse HTML in other parts of the code, so I think that's the way to go here as well. However, if we just run

cheerio.load("<body><div ></div></body>")("body").html()

The result will be <div></div>, but findTagContent should return <div ></div> (extra space at end of div tag).

Thanks for offering to submit a PR! Let me know if what I wrote here makes sense :)

Let me know if you're still interested in looking into this. Otherwise I'll try and fix it later this week.

Thanks again for reporting!