Exception at findTagContent in getHeadAndBodyContent.js
Closed this issue · 2 comments
This function has severe limitations and causes breakage on some sites (null reference exception at line 8: Cannot read property '0' of null
).
- It doesn't account for whitespace in tags, e.g.:
< body>
,< / body >
and others won't match. - It is case-sensitive while HTML tagnames are case-insensitive
- It doesn't necessarily match the correct closing tag (shouldn't be a problem for the head/body use-case but then the function's name should probably be changed to reflect that, e.g.:
findUniqueTagContent
).
I can submit a PR later if you want.
Those are all good points! I think another issue with that function is that it will mistake a commented out <body>
tag with the real thing, since it just operates on a string.
I'm using cheerio to parse HTML in other parts of the code, so I think that's the way to go here as well. However, if we just run
cheerio.load("<body><div ></div></body>")("body").html()
The result will be <div></div>
, but findTagContent
should return <div ></div>
(extra space at end of div tag).
Thanks for offering to submit a PR! Let me know if what I wrote here makes sense :)
Let me know if you're still interested in looking into this. Otherwise I'll try and fix it later this week.
Thanks again for reporting!