openzipkin/zipkin-gcp

Timeout in StackdriverSender kills the flushThread of AsyncReporter

pverkest opened this issue · 4 comments

We're having the following exception that causes our zipkin-reporter-service to stop sending traces to Stackdriver:

2019-05-17 06:42:10.486  WARN ... z.r.AsyncReporter$BoundedAsyncReporter   : Unexpected error flushing spans

java.lang.IllegalStateException: timeout waiting for onClose. timeoutMs=5000, resultSet=false
        at zipkin2.reporter.stackdriver.internal.AwaitableUnaryClientCallListener.await(AwaitableUnaryClientCallListener.java:45)
        at zipkin2.reporter.stackdriver.internal.UnaryClientCall.doExecute(UnaryClientCall.java:46)
        at zipkin2.Call$Base.execute(Call.java:379)
        at zipkin2.Call$Mapping.doExecute(Call.java:237)
        at zipkin2.Call$Base.execute(Call.java:379)
        at zipkin2.reporter.AsyncReporter$BoundedAsyncReporter.flush(AsyncReporter.java:286)
        at zipkin2.reporter.AsyncReporter$Builder$1.run(AsyncReporter.java:190)

Exception in thread "AsyncReporter{StackdriverSender{parts-prod}}" java.lang.IllegalStateException: timeout waiting for onClose. timeoutMs=5000, resultSet=false
        at zipkin2.reporter.stackdriver.internal.AwaitableUnaryClientCallListener.await(AwaitableUnaryClientCallListener.java:45)
        at zipkin2.reporter.stackdriver.internal.UnaryClientCall.doExecute(UnaryClientCall.java:46)
        at zipkin2.Call$Base.execute(Call.java:379)
        at zipkin2.Call$Mapping.doExecute(Call.java:237)
        at zipkin2.Call$Base.execute(Call.java:379)
        at zipkin2.reporter.AsyncReporter$BoundedAsyncReporter.flush(AsyncReporter.java:286)
        at zipkin2.reporter.AsyncReporter$Builder$1.run(AsyncReporter.java:190)

The timeout in AwaitableUnaryClientCallListener throws an IllegalStateException, which causes the AsyncReporter flushThread to stop sending spans.
Is it possible to use another exception type and to ignore the spans that cause the timeout instead of aborting the thread?

probably IOException is appropriate

We're also seeing this issue (Spring Boot 2.1.5; spring-cloud-gcp-starter-trace project, OpenJDK 11 ). Increasing the timeout to 60s has no effect.

Are there any known workarounds?

Thanks

@adriancole IllegalStateException is a runtime exception, though, and both IOException and RuntimeException seem to be handled in zipkin-reporter-java's AsyncReporter here. But then IllegalStateException is handled specially and rethrown.

Would you recommend special handling in zipkin-gcp to convert IllegalStateException to another type when sending spans? Alternatively, perhaps AsyncReporter does not need to rethrow every IllegalStateException?

Agree that AsyncReporter shouldn't rethrow every IllegalStateException - sent openzipkin/zipkin-reporter-java#166 to fix