Findwise/Hydra

Stages discarding documents in a parallell pipeline makes other stages log errors

Opened this issue · 8 comments

When multiple stages are doing work on a document and one of them discards the document, other stages working on the same document will attempt to persist it and fail.

Ideally, stages would know if another stage has discarded the working document, and be able to act on that (perhaps by simply ignoring the document). Documents would need to remain in the documents collection for that to work, I think, and no new stages should be able to fetch the document.

The current behaviour yields logs filled with:

2013-07-05 13:48:53 : INFO   : Thread-188 : com.findwise.hydra.StreamLogger : Received message from stage systest-strip-html (stdin): DEBUG Saving document to RemotePipeline..
2013-07-05 13:48:53 : INFO   : Thread-188 : com.findwise.hydra.StreamLogger : Received message from stage systest-strip-html (stdin): ERROR Node gave an unexpected response: HTTP/1.1 404 Not Found
2013-07-05 13:48:53 : INFO   : Thread-188 : com.findwise.hydra.StreamLogger : Received message from stage systest-strip-html (stdin): ERROR Message: No document found matching your query
2013-07-05 13:48:53 : INFO   : Thread-188 : com.findwise.hydra.StreamLogger : Received message from stage systest-strip-html (stdin): ERROR Node gave an unexpected response: HTTP/1.1 404 Not Found
2013-07-05 13:48:53 : INFO   : Thread-188 : com.findwise.hydra.StreamLogger : Received message from stage systest-strip-html (stdin): ERROR Message: No document found matching your query
2013-07-05 13:48:53 : INFO   : Thread-188 : com.findwise.hydra.StreamLogger : Received message from stage systest-strip-html (stdin): $STACKTRACE$ 
2013-07-05 13:48:53 : INFO   : Thread-188 : com.findwise.hydra.StreamLogger : Received message from stage systest-strip-html (stdin): ERROR Unable to persist an error to the database
2013-07-05 13:48:53 : INFO   : Thread-188 : com.findwise.hydra.StreamLogger : Received message from stage systest-strip-html (stdin): java.io.IOException: Unable to save changes to core
2013-07-05 13:48:53 : INFO   : Thread-188 : com.findwise.hydra.StreamLogger : Received message from stage systest-strip-html (stdin):   at com.findwise.hydra.stage.AbstractProcessStage.run(AbstractProcessStage.java:114)
2013-07-05 13:48:53 : INFO   : Thread-188 : com.findwise.hydra.StreamLogger : Received message from stage systest-strip-html (stdin): 
2013-07-05 13:48:53 : INFO   : Thread-188 : com.findwise.hydra.StreamLogger : Received message from stage systest-strip-html (stdin): $STACKTRACE$ 

Yeah, would be good to fix this, but a bit tricky to see how. In a sense,
stages that discard documents are very similar to outputstages in the sense
that once you discard/accept a document, it is dead. I think I am correct
(but I am unsure) that you cannot have multiple output stages, since they
would conflict?

Also, I don't like the fact that it is logging to stdout (mistakenly tagged
as "stdin", fixed directly in master:
350c16c
).

/Petter

On Fri, Jul 5, 2013 at 2:01 PM, Olof Nilsson notifications@github.comwrote:

When multiple stages are doing work on a document and one of them discards
the document, other stages working on the same document will attempt to
persist it and fail.

Ideally, stages would know if another stage has discarded the working
document, and be able to act on that (perhaps by simply ignoring the
document). Documents would need to remain in the documents collection for
that to work, I think, and no new stages should be able to fetch the
document.

The current behaviour yields logs filled with:

2013-07-05 13:48:53 : INFO : Thread-188 : com.findwise.hydra.StreamLogger : Received message from stage systest-strip-html (stdin): DEBUG Saving document to RemotePipeline..
2013-07-05 13:48:53 : INFO : Thread-188 : com.findwise.hydra.StreamLogger : Received message from stage systest-strip-html (stdin): ERROR Node gave an unexpected response: HTTP/1.1 404 Not Found
2013-07-05 13:48:53 : INFO : Thread-188 : com.findwise.hydra.StreamLogger : Received message from stage systest-strip-html (stdin): ERROR Message: No document found matching your query
2013-07-05 13:48:53 : INFO : Thread-188 : com.findwise.hydra.StreamLogger : Received message from stage systest-strip-html (stdin): ERROR Node gave an unexpected response: HTTP/1.1 404 Not Found
2013-07-05 13:48:53 : INFO : Thread-188 : com.findwise.hydra.StreamLogger : Received message from stage systest-strip-html (stdin): ERROR Message: No document found matching your query
2013-07-05 13:48:53 : INFO : Thread-188 : com.findwise.hydra.StreamLogger : Received message from stage systest-strip-html (stdin): $STACKTRACE$
2013-07-05 13:48:53 : INFO : Thread-188 : com.findwise.hydra.StreamLogger : Received message from stage systest-strip-html (stdin): ERROR Unable to persist an error to the database
2013-07-05 13:48:53 : INFO : Thread-188 : com.findwise.hydra.StreamLogger : Received message from stage systest-strip-html (stdin): java.io.IOException: Unable to save changes to core
2013-07-05 13:48:53 : INFO : Thread-188 : com.findwise.hydra.StreamLogger : Received message from stage systest-strip-html (stdin): at com.findwise.hydra.stage.AbstractProcessStage.run(AbstractProcessStage.java:114)
2013-07-05 13:48:53 : INFO : Thread-188 : com.findwise.hydra.StreamLogger : Received message from stage systest-strip-html (stdin):
2013-07-05 13:48:53 : INFO : Thread-188 : com.findwise.hydra.StreamLogger : Received message from stage systest-strip-html (stdin): $STACKTRACE$


Reply to this email directly or view it on GitHubhttps://github.com//issues/226
.

I think I am correct (but I am unsure) that you cannot have multiple output stages, since they would conflict?

That is true, both discarded and processed are extra states that documents get, that let them be moved to oldDocuments. The routing logic in Hydra isn't very flexible...

Also, I don't like the fact that it is logging to stdout

Agreed, it's an old stage that should be updated to work with the new logging, but haven't gotten around to it yet.

Any thoughts on solving this in the API-rewrite @remen

The issue "AbstractProcessStage will attempt to persist discarded documents" actually had a more accurate name.

I suggested just overwriting the persist method in the implementation as a quick fix. That wouldn't infer any strange bugs would it..?

I'm not sure I understand the problem since I don't really understand the behavior of (or the use-case for) parallel pipelines very well.

First of all. If you have a linear pipeline, is it still a problem? In that case the solution strategy is simple (but the implementation may not be), just don't give a stage a document if it is discarded (change in the core).

If the problem is only when you have concurrent stages then I think the pipeline is broken. If a stage depends on whether a document is discarded or not, then certainly it must come after the stage that discards it. Or can you show me a use case where that doesn't make sense?

@ebbesson No, not as such since this more has to do with how parallel pipelines and the core. However, I am thinking on how to add the possibility for a stage to discard a document and stop processing when it doesn't have access to the RemotePipeline class, and my current best solution is for it to throw a DiscardDocumentException. I'd like to a have a bigger discussion on this and some other ideas I have, but I would like a more efficient way than over github. Maybe a tech talk or something when I get back to work in february?

Or is the problem that even this will fail (since it first sets discarded, then returns and therefore commits):

public void process(LocalDocument doc) throws ... {
    getRemotePipeline().discard(doc);
}

Because this is fixed by my proposed api changes.

[UPDATE] Yes, this bug is #146 😄

@simonstenstrom Sounds like a good workaround. Store a "isDiscarded" boolean in the stage and override persist to look at it. Should work. But then again, the error itself ain't that bad, so maybe it's not worth it?

It will however spam the logs for every document that is discarded at error level. I guess it could be resolved by turning off that logger in logback