philipwalton/html-inspector

Don't analyse the content of scripts

jonmiles opened this issue · 8 comments

I've noticed two potential bugs, both related (I believe) to the inspector analysing the content of scripts tags. I would have thought scripts should be ignored?

Both examples are taken from a real website, a large retailer.

Bug 1.

Throws ReferenceError: Can't find variable: s_c_il. Apparently when analysing the src attribute of a script tag.

Here is the a test snippet taken straight from the site.

<!DOCTYPE html><html lang="en-GB" class=" hasJS desktop hasTouch notIE678">
<head>
<script type="text/javascript" async="async" src="http://dpm.demdex.net/id?d_rtbd=json&amp;d_ver=2&amp;d_orgid=1BB646735278569C0A490D45%40AdobeOrg&amp;d_mid=44962865160045396129049647691892029103&amp;d_cb=s_c_il%5B0%5D._setAudienceManagerFields"></script>
...
</head>
</html>

Bug 2.

The duplicate-ids rule fires when string id="val" also found in script content.

This is the warning I receive.

[duplicate-ids] The id 'cp' appears more than once in the document.

When I search the content, I find one valid html assignment

<div id="cp">...</div>

and one instance assigning a JavaScript object attribute.

cp.id="cp";

Btw, great work, this is a great little tool... I just need to work out how to work around these issues.

In case it helps, I've found two more examples of what I previously reported as Bug 1.

Example 1

<!DOCTYPE html><html lang="en-US" xmlns="http://www.w3.org/1999/xhtml"><head>
<script type="text/javascript" async="" src="file:///home/milesj/Documents/seotodo/data/3pxt6/49n6b.js"></script><script src="https://apis.google.com/_/scs/apps-static/_/js/k=oz.gapi.en_GB.nA4eUF06vL8.O/m=plusone/rt=j/sv=1/d=1/ed=1/am=AQ/rs=AGLTcCNcXA0KaQorIa7m2ep19XOiE7Q6rQ/t=zcms/cb=gapi.loaded_0" async=""></script>
</head><html>
ReferenceError: Can't find variable: gapi

Example 2

<!DOCTYPE html><html><body><script src="https://cdn.syndication.twimg.com/widgets/timelines/478801075776942080?callback=__twttr.callbacks.tl_478801075776942080_1&amp;dnt=false&amp;domain=blog.buzzvil.com&amp;lang=en&amp;suppress_response_codes=true&amp;t=1591762" async=""></script></body></html>
ReferenceError: Can't find variable: __twttr

In both these examples, it appears to be throwing the error while parsing a url query string. Is it supposed to parse query strings?

Have I set something up wrong? It's possible I know, but I can't see what it could be. It works on some sites, but not on others.

One final thing, in case it's related in some way. When I remove I remove the src value completely e.g. src="" it throws a different error SyntaxError: Parse error. It's not a warning that there's a syntax error, it's the htmlinspector process failing to handle the content. That doesn't make any sense to me either, as surely src="" is a valid mistake, as opposed to syntax the parser can not handle. This again makes me think I'm doing something wrong, but if I am I have no idea.

@jonmiles, HTML Inspector doesn't parse the contents of script tags, so there must be something else going on. Do you have these examples hosted anywhere so I could take a look?

I'm getting a similar response when running the CLI version of html-inspector 0.8.1 with phantomjs 1.9.8 using stock config against http://theinnerlogic.com/.

ReferenceError: Can't find variable: requestAnimationFrame

It works on some sites, but not on others.

I don't think PhantomJS supports requestAnimationFrame, so that's probably where the error is coming from. Either way, it's a JavaScript error, not an error produced by HTML Inspector.

I think what we're asking is why is the html-inspector throwing JavaScript errors?

If it doesn't inspect scripts why is the cli output referring to a JavaScript variable, or part of a URL in the script tag's SRC?

What was wrong with the two examples I provided? Both snippets reproduce the errors on their own. Is it not valid HTML?

I've actually rewrote your code to work without phantomjs, or a browser, so this is no longer a problem for me.

I think what we're asking is why is the html-inspector throwing JavaScript errors?

It's not. PhantomJS is throwing those errors. HTML Inspector needs a JavaScript runtime with a DOM in order to work, but that means if there are other JavaScript errors on the page, those errors will affect HTML Inspector's ability to do it's work.

What was wrong with the two examples I provided? Both snippets reproduce the errors on their own. Is it not valid HTML?

Again, these errors have nothing to do with HTML Inspector or the validity of the HTML. They're being thrown because there are errors with the JavaScript (either downloading it, parsing it, executing it, etc.).

In phantom-runner.js in your page.onError handler you're re-throwing JavaScript errors from the page and then exiting.

You don't need to do this, the JavaScript errors do not affect the HTML Inspectors ability to do it's job.

The only errors you should be concerned with are from the phantom.onError callback.

So you see the errors are coming from the HTML inspector, and essentially you are inspecting the content of scripts.

the JavaScript errors do not affect the HTML Inspectors ability to do it's job.

One of the core principles of HTML Inspector is that it operates on live DOM. This allows it to inspect DOM that's dynamically generated as is the case in many single page applications today.

If there's a JavaScript error on the page that prevents the DOM from generating, HTML Inspector may (incorrectly) report that it found no errors, when in fact it would have had the JavaScript run properly.

In phantom-runner.js in your page.onError handler you're re-throwing JavaScript errors from the page and then exiting.

I can see an argument for not exiting, but I can also see an argument for wanting to fail the tests if any JavaScript errors are found.

Honestly, the main problem here (as I see it) is not that that I'm exiting on JS errors, it's that PhantomJS doesn't support a lot of the JavaScript features used by devs today, so it's saying the page contains errors when it doesn't if run in a modern browser.

I'm very much open to replacing PhantomJS. In fact, I started trying to use jsdom a few months back, but at the time it didn't support all the features needed.

FWIW, In my personal tests, I don't use the CLI at all; I use selenium webdriver.