CDX indexer endless loop and memleak for certain warc files
vitezg opened this issue · 5 comments
We have a few problematic warc files, created by heritrix 3, for which the CDX indexer seems to get stuck: CPU usage goes up to 100%, memory usage up to the -Xmx limit (16GB), and the CDX indexer stops producing output (for testing we just let it write to the standard output, and once stuck it stops writing). The last lines of the output are consistent, the indexer always gets stuck at the same place for the same file.
The catch is the problematic warc file is about 9 GB and should not be made public, I'm only authorized to send the URL for it in email.
Any chance you could take a look?
A thread dump might give a rough indication of where the problem is. Try pressing Ctrl+\ (Linux/Mac) or Ctrl+Break (Windows) in the terminal running the indexer. Although if it is stuck trying to allocate that might not necessarily indicate the source of the leak.
The first thread dump is right after the indexer getting stuck, the second one is from a bit later. All from the main thread only - if you need the date from the GC and other threads I can send those too.
"main" prio=10 tid=0x00007fa7cc00c000 nid=0x5208 runnable [0x00007fa7d3faf000] java.lang.Thread.State: RUNNABLE at org.htmlparser.nodes.TagNode.getTagName(TagNode.java:398) at org.archive.wayback.util.htmllex.NodeUtils.isCloseTagNodeNamed(NodeUtils.java:72) at org.archive.wayback.util.htmllex.ContextAwareLexer.nextNode(ContextAwareLexer.java:87) at org.archive.wayback.resourcestore.indexer.HTTPRecordAnnotater.annotateHTMLContent(HTTPRecordAnnotater.java:156) at org.archive.wayback.resourcestore.indexer.HTTPRecordAnnotater.annotateHTTPContent(HTTPRecordAnnotater.java:141) at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adaptWARCHTTPResponse(WARCRecordToSearchResultAdapter.java:303) at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adaptInner(WARCRecordToSearchResultAdapter.java:114) at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adapt(WARCRecordToSearchResultAdapter.java:79) at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adapt(WARCRecordToSearchResultAdapter.java:53) at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:57) at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55) at org.archive.wayback.resourcestore.indexer.IndexWorker.main(IndexWorker.java:216)
Second one:
"main" prio=10 tid=0x00007fa7cc00c000 nid=0x5208 runnable [0x00007fa7d3faf000] java.lang.Thread.State: RUNNABLE at org.htmlparser.lexer.InputStreamSource.fill(InputStreamSource.java:337) at org.htmlparser.lexer.InputStreamSource.read(InputStreamSource.java:396) at org.htmlparser.lexer.Page.getCharacter(Page.java:705) at org.htmlparser.lexer.Lexer.parseString(Lexer.java:735) at org.htmlparser.lexer.Lexer.nextNode(Lexer.java:398) at org.archive.wayback.util.htmllex.ContextAwareLexer.nextNode(ContextAwareLexer.java:72) at org.archive.wayback.resourcestore.indexer.HTTPRecordAnnotater.annotateHTMLContent(HTTPRecordAnnotater.java:156) at org.archive.wayback.resourcestore.indexer.HTTPRecordAnnotater.annotateHTTPContent(HTTPRecordAnnotater.java:141) at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adaptWARCHTTPResponse(WARCRecordToSearchResultAdapter.java:303) at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adaptInner(WARCRecordToSearchResultAdapter.java:114) at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adapt(WARCRecordToSearchResultAdapter.java:79) at org.archive.wayback.resourcestore.indexer.WARCRecordToSearchResultAdapter.adapt(WARCRecordToSearchResultAdapter.java:53) at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:57) at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55) at org.archive.wayback.resourcestore.indexer.IndexWorker.main(IndexWorker.java:216)
Just noting this related issue: #162
The indexer only does this parsing to poke around for robots.txt
assertions that most (any?) of us don't make use of. Maybe we should modify things so it's optional/off-by-default?
Just for testing I put a return;
at the front of annotateHTTPContent, before robotFlags.reset();
(