Findwise/Hydra

Writing duplicate id into oldDocuments causes Hydra core to crash due to unhandled exception

Closed this issue · 4 comments

As provided by the stacktrace
Affected versions 0.4.0

2014-06-03 11:11:43,543 [I/O dispatcher 5] ERROR com.findwise.hydra.net.HttpRESTHandler - Unhandled exception occurred
com.mongodb.MongoException: duplicate key insert for unique index of capped collection
    at com.mongodb.CommandResult.getException(CommandResult.java:100) ~[hydra-core.jar:na]
    at com.mongodb.CommandResult.throwOnError(CommandResult.java:134) ~[hydra-core.jar:na]
    at com.mongodb.DBTCPConnector._checkWriteError(DBTCPConnector.java:142) ~[hydra-core.jar:na]
    at com.mongodb.DBTCPConnector.say(DBTCPConnector.java:183) ~[hydra-core.jar:na]
    at com.mongodb.DBTCPConnector.say(DBTCPConnector.java:155) ~[hydra-core.jar:na]
    at com.mongodb.DBApiLayer$MyCollection.insert(DBApiLayer.java:270) ~[hydra-core.jar:na]
    at com.mongodb.DBApiLayer$MyCollection.insert(DBApiLayer.java:226) ~[hydra-core.jar:na]
    at com.mongodb.DBCollection.insert(DBCollection.java:75) ~[hydra-core.jar:na]
    at com.mongodb.DBCollection.insert(DBCollection.java:59) ~[hydra-core.jar:na]
    at com.mongodb.DBCollection.insert(DBCollection.java:104) ~[hydra-core.jar:na]
    at com.findwise.hydra.mongodb.MongoDocumentIO.writeToOldDocuments(MongoDocumentIO.java:547) ~[hydra-core.jar:na]
    at com.findwise.hydra.mongodb.MongoDocumentIO.markDone(MongoDocumentIO.java:522) ~[hydra-core.jar:na]
    at com.findwise.hydra.mongodb.MongoDocumentIO.markProcessed(MongoDocumentIO.java:577) ~[hydra-core.jar:na]
    at com.findwise.hydra.CachingDocumentNIO.markProcessed(CachingDocumentNIO.java:127) ~[hydra-core.jar:na]
    at com.findwise.hydra.net.MarkHandler.mark(MarkHandler.java:108) ~[hydra-core.jar:na]
    at com.findwise.hydra.net.MarkHandler.handle(MarkHandler.java:74) ~[hydra-core.jar:na]
    at com.findwise.hydra.net.HttpRESTHandler.dispatch(HttpRESTHandler.java:89) ~[hydra-core.jar:na]
    at com.findwise.hydra.net.HttpRESTHandler.handle(HttpRESTHandler.java:110) ~[hydra-core.jar:na]
    at org.apache.http.nio.protocol.BasicAsyncRequestHandler.handle(BasicAsyncRequestHandler.java:68) [hydra-core.jar:na]
    at org.apache.http.nio.protocol.BasicAsyncRequestHandler.handle(BasicAsyncRequestHandler.java:47) [hydra-core.jar:na]
    at org.apache.http.nio.protocol.HttpAsyncService.processRequest(HttpAsyncService.java:521) [hydra-core.jar:na]
    at org.apache.http.nio.protocol.HttpAsyncService.inputReady(HttpAsyncService.java:288) [hydra-core.jar:na]
    at org.apache.http.impl.nio.DefaultNHttpServerConnection.consumeInput(DefaultNHttpServerConnection.java:199) [hydra-core.jar:na]
    at org.apache.http.impl.nio.DefaultHttpServerIODispatch.onInputReady(DefaultHttpServerIODispatch.java:117) [hydra-core.jar:na]
    at org.apache.http.impl.nio.DefaultHttpServerIODispatch.onInputReady(DefaultHttpServerIODispatch.java:50) [hydra-core.jar:na]
    at org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:112) [hydra-core.jar:na]
    at org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:160) [hydra-core.jar:na]
    at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:342) [hydra-core.jar:na]
    at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:320) [hydra-core.jar:na]
    at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:280) [hydra-core.jar:na]
    at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:106) [hydra-core.jar:na]
    at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:604) [hydra-core.jar:na]
    at java.lang.Thread.run(Thread.java:662) [na:1.6.0_30]
2014-06-03 11:11:43,543 [I/O dispatcher 5] ERROR com.findwise.hydra.net.HttpResponseWriter - Printing Unhandled Exception
2014-06-03 11:11:43,558 [Thread-94] INFO ch.qos.logback.classic.net.SocketNode - Caught java.net.SocketException closing connection.
2014-06-03 11:11:43,558 [Thread-128] INFO ch.qos.logback.classic.net.SocketNode - Caught java.net.SocketException closing connection.
2014-06-03 11:11:43,558 [Thread-52] INFO ch.qos.logback.classic.net.SocketNode - Caught java.net.SocketException closing connection.
2014-06-03 11:11:43,558 [Thread-125] INFO ch.qos.logback.classic.net.SocketNode - Caught java.net.SocketException closing connection.

Should be easy to add an exception handler for MongoException to the oldDocuments.insert() call.
However, how did the duplicate ID happen? Manual modification of oldDocuments, or something else?

It's not quite clear to me atm.
We're experiencing that documents are randomly not being released by some stages. (That might be a bigger issue though)

To mitigate this we're resetting their states by modifying their state in mongo if they have been stuck too long. This is done in a rather naive way, so unfortunate reset can lead to this, but I'm not sure.

We're experiencing that documents are randomly not being released by some stages. (That might be a bigger issue though)

Sounds bigger! What do you mean 'released'? Not marked as touched? Depending on the queries for your stages, stages are free to mark documents as processed, discarded or failed even if another stage has fetched-not-touched. But if you don't have such concurrent stages, it shouldn't happen.

To mitigate this we're resetting their states by modifying their state in mongo if they have been stuck too long. This is done in a rather naive way, so unfortunate reset can lead to this, but I'm not sure.

Documents are removed from the documents collection before being inserted into oldDocuments, IIRC. How are you modifying the documents? Make sure the modification is atomic.

I've seen this concurrency issue when Hydra was under heavy load and all documents were picked up by two mutually exclusive stages; one stage processed and output the document, the other stage wanted to fail the document.

Fix incoming.