Links in onClick property not captured in WAT 'Links' metadata
Closed this issue · 6 comments
Some examples below:
div onclick="location.href='webpage.html'"
input type=button onClick="parent.location='index.html'" value='click here'
input type=button onClick="parent.open('http://www.x.com/')" value='new window'
input type=button onClick=window.open("button-child.php","demo","width=550,height=300,left=150,top=200,toolbar=0,status=0,"); value="Open child Window"
input type="button" value="Open" onclick="window.location.href='http://www.y.com/'"
Thanks for reporting this problem. The challenge is to extract the URL from the value of the onclick
attribute, esp. because quoting in embedded Javascript isn't trivial, e.g.: onclick="window.open('http://example.com/', #39;width=500');"
Need to find a reliable solution, given that the onclick
attribute is frequent and also other event-handler attributes (onsubmit
etc.) should ideally be covered.
Any further thoughts on this? Seems like a partial solution would still get you pretty far.
Thanks, @sebastian-nagel! That was my next question :)
Have you by any chance done an analysis of how this change increases URL counts? Quite curious to know the answer.
I've only verified it on a single WARC (CC-MAIN-20170629154125-20170629174125-00719.warc.gz): 3200 more links for 131,000 records (934,000 links before). Here the overview of link "paths":
7777909 A@/href
1266284 IMG@/src
90022 STYLE/#text
82498 FORM@/action
30165 A@/data-href
29271 IFRAME@/src
12383 DIV@/data-href
9034 TD@/background
8339 AREA@/href
7932 SPAN@/data-href
7595 INPUT@/src
6296 IMG@/longdesc
2710 DIV@/onclick <<<<<
2524 EMBED@/src
1521 TABLE@/background
1481 BUTTON@/data-href
1125 BLOCKQUOTE@/cite
995 OBJECT@/codebase
860 OBJECT@/data
608 SOURCE@/src
500 INPUT@/onclick <<<<<
405 LI@/data-href
378 INPUT@/data-href
370 BODY@/background
351 LABEL@/data-href
Interesting, thanks Sebastian.