bitrich-info/xchange-stream

Kraken WebSocket reconnection leading to out of memory exception

Closed this issue · 29 comments

Hello,

yesterday (14 July 2019) Kraken had official outage.
The websocket service was disconnected and xchange-stream continously
wanted to reconnect in short time intervals.

A huge number of reconnection attempts lead to the following exception:

2019-07-14 21:08:46 | ERROR | [nioEventLoopGroup-675-1] i.b.x.s.n.WebSocketClientHandler:66 - WebSocket Client failed to connect. Invalid handshake response getStatus: 503 Service Temporarily Unavailable
2019-07-14 21:08:47 | WARN | [nioEventLoopGroup-675-1] i.b.x.k.KrakenStreamingService:211 - Problem with connection: class io.netty.handler.codec.http.websocketx.WebSocketHandshakeException - Invalid handshake response getStatus: 503 Service Temporarily Unavailable
2019-07-14 21:08:47 | WARN | [nioEventLoopGroup-681-1] i.b.x.k.KrakenStreamingService:213 - Problem with connection
java.lang.IllegalStateException: executor not accepting a task
at io.netty.resolver.AddressResolverGroup.getResolver(AddressResolverGroup.java:60)
at io.netty.bootstrap.Bootstrap.doResolveAndConnect0(Bootstrap.java:200)
at io.netty.bootstrap.Bootstrap.access$000(Bootstrap.java:49)
at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:188)
at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:174)
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:502)
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:476)
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:415)
at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:540)
at io.netty.util.concurrent.DefaultPromise.setSuccess0(DefaultPromise.java:529)
at io.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:101)
at io.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:84)
at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetSuccess(AbstractChannel.java:1009)
at io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:517)
at io.netty.channel.AbstractChannel$AbstractUnsafe.access$200(AbstractChannel.java:428)
at io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:487)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:405)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:906)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)

After 2 hours the process run out of memory and exited with.
2019-07-14 21:11:25 | ERROR | [Thread-5] c.e.w.CandleWriter:67 - Stack Trace: java.lang.OutOfMemoryError: Java heap space

Am I missing something? Maybe some disconnect or something?
Could there be something in xchange-stream?

How could I continue the investigation?

Thank you and regards,
Jure

#269 added support for adding a RateLimiter or TimedSemaphore to limit the reconnection rate (this, like enforcing rate limits for REST calls, is the caller's responsibility, so xchange-stream only provides a hook). You can use this now.

I believe #340 should prevent reconnections leaking resources. This hasn't been reviewed yet.

Not so familiar with xchange-stream yet.
Ok, so you suggest using this for time being:

static final TimedSemaphore limiter = new TimedSemaphore(1, MINUTES, 15);
ExchangeSpecification spec = exchange.getDefaultExchangeSpecification();
spec.setExchangeSpecificParameters(ImmutableMap.of(
StreamingParameters.ON_BEFORE_CONNECT,
() -> limiter.acquire()
));

and when #340 merged, it should be ok without that also?

I am not seeing this string constant anywhere StreamingParameters.ON_BEFORE_CONNECT.
What am I missing?

There is common Bitrich issue related to Netty implimentation. Sometimes netty objects get leaks. I did heap dump on OutOfMemory and found a lot of leaked Netty objects in the one. This always happened only in case remote soket server became down. There is some issue in fail resubscribing logic in the NettyStreamingService, but I did not understand how it happaned. I saw OutOfMemory on Bitstamp and Bitfinex connectors on my production and each time it was Netty.

From my perspective netty has some memory leek issue. Calling eventLoopGroup.shutdownGracefully couse closing connections but not enough quick. Maybe there is some case in NettyStreaminService when netty connections are restarting too quick. But I didn't found them. Sorry.

@jurepetrovic Did u do any heap dump analisys? What objects are get leaked in your case?

I saw this recently with Bitfinex too when they had an outage.

I changed the way disconnect/reconnect works relatively recently so the same EventLoopGroup is used on reconnection (see here). There's no shutdownGracefully() unless you explicitly disconnect(), so I don't think it's related to shutdownGracefully()

This issue could probably be simulated in testing using a proxy set up to start throwing HTTP errors after a certain time has passed. At least then we could debug what's going on.

Has anyone tried to simulate the error like @badgerwithagun suggested?

We have catched this issue again at 14th of sept. I have analyzed heap dump and see the same picture: the huge number of leaked EventLoopGroup objects

Same here. And I haven't yet had the time to try the solution with the semaphores. Is anyone using it and happy with it?

Thanks,
Jure

We're not leaking it in an obvious way - we don't even create new ones unless we have one and it's already been shut down.

I suspect more likely what's going on here is that there a load of other objects maintaining a reference to the old, shutdown eventloopgroups. That is.... we're leaking something else.

Are there lots of other Netty objects on the heap, @pchertalev? That heap dump is probably the best source of evidence right now.

I know how to 100% reproduce this issue - I changed kraken URL from
private static final String API_URI = "wss://ws.kraken.com";
to
private static final String API_URI = "wss://api.bitfinex.com";
And I got a lot of leaking NioEventLoop objects

I see a lot of reconnects in the second

17:22:31.956 [nioEventLoopGroup-2-2] INFO  info.bitrich.xchangestream.service.netty.WebSocketClientHandler - WebSocket Client disconnected!
17:22:31.956 [nioEventLoopGroup-2-2] INFO  info.bitrich.xchangestream.kraken.KrakenStreamingService - Reopening websocket because it was closed
17:22:31.956 [nioEventLoopGroup-2-2] INFO  info.bitrich.xchangestream.kraken.KrakenStreamingService - Scheduling reconnection
17:22:31.956 [nioEventLoopGroup-2-2] INFO  info.bitrich.xchangestream.service.netty.WebSocketClientHandler - WebSocket Client disconnected!
17:22:31.956 [nioEventLoopGroup-2-2] INFO  info.bitrich.xchangestream.kraken.KrakenStreamingService - Reopening websocket because it was closed
17:22:31.956 [nioEventLoopGroup-2-2] INFO  info.bitrich.xchangestream.kraken.KrakenStreamingService - Scheduling reconnection
17:22:31.956 [nioEventLoopGroup-2-2] INFO  info.bitrich.xchangestream.service.netty.WebSocketClientHandler - WebSocket Client disconnected!
17:22:31.956 [nioEventLoopGroup-2-2] INFO  info.bitrich.xchangestream.kraken.KrakenStreamingService - Reopening websocket because it was closed
17:22:31.956 [nioEventLoopGroup-2-2] INFO  info.bitrich.xchangestream.kraken.KrakenStreamingService - Scheduling reconnection
17:22:31.956 [nioEventLoopGroup-2-2] INFO  info.bitrich.xchangestream.service.netty.WebSocketClientHandler - WebSocket Client disconnected!
17:22:31.956 [nioEventLoopGroup-2-2] INFO  info.bitrich.xchangestream.kraken.KrakenStreamingService - Reopening websocket because it was closed
17:22:31.956 [nioEventLoopGroup-2-2] INFO  info.bitrich.xchangestream.kraken.KrakenStreamingService - Scheduling reconnection
17:22:31.957 [nioEventLoopGroup-2-2] INFO  info.bitrich.xchangestream.service.netty.WebSocketClientHandler - WebSocket Client disconnected!
17:22:31.957 [nioEventLoopGroup-2-2] INFO  info.bitrich.xchangestream.kraken.KrakenStreamingService - Reopening websocket because it was closed
17:22:31.957 [nioEventLoopGroup-2-2] INFO  info.bitrich.xchangestream.kraken.KrakenStreamingService - Scheduling reconnection
17:22:31.957 [nioEventLoopGroup-2-2] INFO  info.bitrich.xchangestream.service.netty.WebSocketClientHandler - WebSocket Client disconnected!
17:22:31.957 [nioEventLoopGroup-2-2] INFO  info.bitrich.xchangestream.kraken.KrakenStreamingService - Reopening websocket because it was closed
17:22:31.957 [nioEventLoopGroup-2-2] INFO  info.bitrich.xchangestream.kraken.KrakenStreamingService - Scheduling reconnection
17:22:31.957 [nioEventLoopGroup-2-2] INFO  info.bitrich.xchangestream.service.netty.WebSocketClientHandler - WebSocket Client disconnected!
17:22:31.957 [nioEventLoopGroup-2-2] INFO  info.bitrich.xchangestream.kraken.KrakenStreamingService - Reopening websocket because it was closed
17:22:31.957 [nioEventLoopGroup-2-2] INFO  info.bitrich.xchangestream.kraken.KrakenStreamingService - Scheduling reconnection
17:22:31.957 [nioEventLoopGroup-2-2] INFO  info.bitrich.xchangestream.service.netty.WebSocketClientHandler - WebSocket Client disconnected!
17:22:31.957 [nioEventLoopGroup-2-2] INFO  info.bitrich.xchangestream.kraken.KrakenStreamingService - Reopening websocket because it was closed
17:22:31.957 [nioEventLoopGroup-2-2] INFO  info.bitrich.xchangestream.kraken.KrakenStreamingService - Scheduling reconnection
17:22:31.957 [nioEventLoopGroup-2-2] INFO  info.bitrich.xchangestream.service.netty.WebSocketClientHandler - WebSocket Client disconnected!
17:22:31.957 [nioEventLoopGroup-2-2] INFO  info.bitrich.xchangestream.kraken.KrakenStreamingService - Reopening websocket because it was closed

There is no any pause between reconects and number of reconects are growing! Something exploded in bitrich + netty !

I have found where the problem)

I have created PR with OutOfMemory fix

I believe we can close this issue? @jurepetrovic

Ok :-)
Thank you for your cooperation, gentlemen!

We're drop dead busy with some client software right now, so I hope to integrate it sometime next week...

Hello,

@pchertalev @badgerwithagun
I'm really sorry to write about this, but it happenned again.
Is it only me this time?
I got 502 Bad Gateway and then reconnecting continued until Out of Memory....
Attaching log snippet here

@jurepetrovic did you use fixed bitrich version?

I see a lot of new nioEventLoopGroup-* threads in your log. I can't reproduce such behaivior for one open socket. I see the same two nioEventLoopGroup-2-1/2 threads in log all the time .
How many Kraken open sockets do u have opened in parallel?
Did you have the chance to look at heap dump?

I can't find any line in code with text: "Reopening websocket because it was closed by the host" (from your log snippet). Are you sure that u are using correct version?

Hello,

I used the version from 18. september, where you deleted scheduleReconnect() call.
I didn't get heap dump, as it was running on production server...

image

Nobody else experienced this issue again?
I guess I'll have to re-check versions and merges and packaging...

Rgds,
Jure

I didn't get heap dump, as it was running on production server...

I recomend to use -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath options on production it is mandatory option for production from my perspective. :)

Do you use forked branch on production? https://github.com/jurepetrovic/xchange-stream
Could you provide commit id of you production version?

Ok, will add this options.
Let's call it pre-production phase ;-)

Yes - I used this commit:
jurepetrovic@8fd1bfe

But need to verify anyway if something happenned during packaging.

I have checked out your branch and your revision and this code doesn't match your log snipped
There is no such line in the code:
2019-09-25 13:37:07 | INFO | [nioEventLoopGroup-17-1] i.b.x.k.KrakenStreamingService:417 - Reopening websocket because it was closed by the host

Jeez!! :(
Seems like I forgot to update version in build.gradle and then compiled
against old version. Sorry gentlemen for troubling you.
Your help and support is much appreciated.