[PROD] [CUSTOMER] Heathcheck for IMAP
chibenwa opened this issue · 4 comments
Why?
reactor.core.Exceptions$ErrorCallbackNotImplemented: org.apache.james.imapserver.netty.ReactiveThrottler$RejectedException: The IMAP server has reached its maximum capacity (concurrent requests: 200, queue size: 4096)
Caused by: org.apache.james.imapserver.netty.ReactiveThrottler$RejectedException: The IMAP server has reached its maximum capacity (concurrent requests: 200, queue size: 4096)
at org.apache.james.imapserver.netty.ReactiveThrottler.throttle(ReactiveThrottler.java:81)
at org.apache.james.imapserver.netty.ImapChannelUpstreamHandler.channelRead(ImapChannelUpstreamHandler.java:373)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:346)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:318)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:346)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:318)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1475)
at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1338)
at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1387)
at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:529)
at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:468)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:290)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:93)
at org.apache.james.imapserver.netty.HAProxyMessageHandler.channelRead(HAProxyMessageHandler.java:85)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at java.base/java.lang.Thread.run(Unknown Source)
One pod was stuck like this... So traffic was partially downgraded (1 pod failing). It do not happen a lot at all (3 month - 1 time)
While we should hopefully investigate seriously the root cause ( #5246 !) the topic is complex and I would like to have an operational alternative to counter this...
The idea: have a healthcheck that would be triggered and that we could aggregate in the liveness probe (CF #5244) the time that we actually fix the issue!
What
Add a heathcheck that for all IMAP servers ensures the reactive throttlers are not not.
Not full -> OK
one full -> degraded
one full -> degraded
FYI k8s consider a pod not healthy when receiving the response code >= 400.
James healthcheck response code:
200: All checks have answered with a Healthy or Degraded status. James services can still be used.
503: At least one check have answered with a Unhealthy status
degraded
with 200 code won't trigger k8s pod restart.
Then we should return unhealthy
instead? Not sure if it is a bit harsh. Anyway, docker and k8s liveness checks allow a number of failures (failureThreshold
defaults to 3) before restarting, therefore a bit more resiliency on the actual high IMAP load may be acceptable.
We could add a flag as a query parameter to consider degraded as failed. EG
GET 127.0.0.1:8000/healthcheck/checks/ImapCheck?strict
Would return 503 response code if unhealthy and degraded
While
GET 127.0.0.1:8000/healthcheck/checks/ImapCheck
Would return 503 when unhealthy and 200 for degraded.
We would need to implement GET 127.0.0.1:8000/healthcheck?strict
too.
Whould this solve your concern @quantranhong1999 ? We would get the best of both worlds...
Maybe this shall be a separate issue? Do you want to open it @quantranhong1999 ?
Maybe this shall be a separate issue? Do you want to open it @quantranhong1999 ?
pr apache#2401