CDX indexer endless loop and memleak for certain warc files

We have a few problematic warc files, created by heritrix 3, for which the CDX indexer seems to get stuck: CPU usage goes up to 100%, memory usage up to the -Xmx limit (16GB), and the CDX indexer stops producing output (for testing we just let it write to the standard output, and once stuck it stops writing). The last lines of the output are consistent, the indexer always gets stuck at the same place for the same file.

The catch is the problematic warc file is about 9 GB and should not be made public, I'm only authorized to send the URL for it in email.

Any chance you could take a look?

A thread dump might give a rough indication of where the problem is. Try pressing Ctrl+\ (Linux/Mac) or Ctrl+Break (Windows) in the terminal running the indexer. Although if it is stuck trying to allocate that might not necessarily indicate the source of the leak.

The first thread dump is right after the indexer getting stuck, the second one is from a bit later. All from the main thread only - if you need the date from the GC and other threads I can send those too.

"main" prio=10 tid=0x00007fa7cc00c000 nid=0x5208 runnable [0x00007fa7d3faf000] java.lang.Thread.State: RUNNABLE at org.htmlparser.nodes.TagNode.getTagName(TagNode.java:398) at org.archive.wayback.util.htmllex.NodeUtils.isCloseTagNodeNamed(NodeUtils.java:72) at org.archive.wayback.util.htmllex.ContextAwareLexer.nextNode(ContextAwareLexer.java:87) at org.archive.wayback.resourcestore.indexer.HTTPRecordAnnotater.annotateHTMLContent(HTTPRecordAnnotater.java:156) at org.archive.wayback.resourcestore.indexer.HTTPRecordAnnotater.annotateHTTPContent(HTTPRecordAnnotater.java:141) at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adaptWARCHTTPResponse(WARCRecordToSearchResultAdapter.java:303) at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adaptInner(WARCRecordToSearchResultAdapter.java:114) at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adapt(WARCRecordToSearchResultAdapter.java:79) at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adapt(WARCRecordToSearchResultAdapter.java:53) at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:57) at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55) at org.archive.wayback.resourcestore.indexer.IndexWorker.main(IndexWorker.java:216)

Second one:

"main" prio=10 tid=0x00007fa7cc00c000 nid=0x5208 runnable [0x00007fa7d3faf000] java.lang.Thread.State: RUNNABLE at org.htmlparser.lexer.InputStreamSource.fill(InputStreamSource.java:337) at org.htmlparser.lexer.InputStreamSource.read(InputStreamSource.java:396) at org.htmlparser.lexer.Page.getCharacter(Page.java:705) at org.htmlparser.lexer.Lexer.parseString(Lexer.java:735) at org.htmlparser.lexer.Lexer.nextNode(Lexer.java:398) at org.archive.wayback.util.htmllex.ContextAwareLexer.nextNode(ContextAwareLexer.java:72) at org.archive.wayback.resourcestore.indexer.HTTPRecordAnnotater.annotateHTMLContent(HTTPRecordAnnotater.java:156) at org.archive.wayback.resourcestore.indexer.HTTPRecordAnnotater.annotateHTTPContent(HTTPRecordAnnotater.java:141) at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adaptWARCHTTPResponse(WARCRecordToSearchResultAdapter.java:303) at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adaptInner(WARCRecordToSearchResultAdapter.java:114) at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adapt(WARCRecordToSearchResultAdapter.java:79) at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adapt(WARCRecordToSearchResultAdapter.java:53) at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:57) at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55) at org.archive.wayback.resourcestore.indexer.IndexWorker.main(IndexWorker.java:216)

Just noting this related issue: #162

The indexer only does this parsing to poke around for robots.txt assertions that most (any?) of us don't make use of. Maybe we should modify things so it's optional/off-by-default?

openwayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/indexer/HTTPRecordAnnotater.java

Lines 137 to 143 in c49f8e7

    
           // Now the sticky part: If it looks like an HTML document, look for 
        
           // robot meta tags: 
        
           if(isHTML(mimeType)) { 
        
           	String fileContext = result.getFile() + ":" + result.getOffset(); 
        
           	annotateHTMLContent(is, encoding, fileContext, result); 
        
           } 
        
           robotFlags.apply(result);

Just for testing I put a return; at the front of annotateHTTPContent, before robotFlags.reset(); (

openwayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/indexer/HTTPRecordAnnotater.java

Line 88 in c49f8e7

robotFlags.reset();

) and now it finishes as expected, no endlees loop, no mem leak.

I am closing this issue now that #403 merged allows the endless loop to be avoided, and #162 is open to address there is still infinite loop potential on some WARCs and references this issue.

	// Now the sticky part: If it looks like an HTML document, look for
	// robot meta tags:
	if(isHTML(mimeType)) {
	String fileContext = result.getFile() + ":" + result.getOffset();
	annotateHTMLContent(is, encoding, fileContext, result);
	}
	robotFlags.apply(result);