commoncrawl/ia-web-commons

StringIndexOutOfBoundsException during WAT/WET generation

Closed this issue · 1 comments

The WEATGenerator chokes on some WARC fails and fails with a StringIndexOutOfBoundsException thrown by ExtractingParseObserver.

...
16/07/04 08:18:53 INFO jobs.WEATGenerator: Add input path: s3a://commoncrawl/crawl-data/CC-MAIN-2016-26/segments/1466783392527.68/warc/CC-MAIN-20160624154952-00042-ip-10-164-35-72.ec2.internal.warc.gz
...
16/07/04 08:18:58 INFO mapreduce.Job: Running job: job_1466588320333_0319
...
16/07/04 08:30:42 INFO mapreduce.Job: Task Id : attempt_1466588320333_0319_m_000000_0, Status : FAILED
Error: java.io.IOException: java.lang.StringIndexOutOfBoundsException: String index out of range: -1
at org.archive.hadoop.jobs.WEATGenerator$WEATGeneratorMapper.map(WEATGenerator.java:126)
at org.archive.hadoop.jobs.WEATGenerator$WEATGeneratorMapper.map(WEATGenerator.java:48)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: -1
at java.lang.String.substring(String.java:1911)
at org.archive.resource.html.ExtractingParseObserver.patternCSSExtract(ExtractingParseObserver.java:447)
at org.archive.resource.html.ExtractingParseObserver.handleStyleNode(ExtractingParseObserver.java:201)
at org.archive.format.text.html.LexParser.doParse(LexParser.java:36)
at org.archive.format.text.html.LexParser.doParse(LexParser.java:18)
at org.archive.resource.html.HTMLResourceFactory.getResource(HTMLResourceFactory.java:31)
at org.archive.extract.ExtractingResourceProducer.getNext(ExtractingResourceProducer.java:54)
at org.archive.hadoop.jobs.WEATGenerator$WEATGeneratorMapper.map(WEATGenerator.java:108)
... 9 more

For the WARC file the problem is caused by the following CSS snippet

#services .avia-logo-element-container img {
    filter: url(\"");
    filter: none;
    -webkit-filter: none;
}

The length check in the method patternCSSExtract is insufficient: if 4 characters are removed the URL must be at least 4 characters long:

  } else if (url.charAt(0) == '\\') {
     if(url.length() == 2)
       continue;
     url = url.substring(2, origUrlLength - 2);