apache/celeborn

[BUG] Celebon on K8s takes more time to complete

Closed this issue · 1 comments

What is the bug(with logs or screenshots)?

I have deployed celeborn on k8s. However while running the job, it takes 7 hours to complete. Without celeborn it takes only 2 hours.

Celeborn worker logs

24/12/04 18:18:29,537 ERROR [fetch-server-11-50] FetchHandler: Sending ChunkFetchSuccess operation failed, chunk StreamChunkSlice[streamId=24095417247,chunkIndex=5,offset=0,len=2147483647] java.io.IOException: Connection reset by peer at sun.nio.ch.FileChannelImpl.transferTo0(Native Method) at sun.nio.ch.FileChannelImpl.transferToDirectlyInternal(FileChannelImpl.java:428) at sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:493) at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:605) at io.netty.channel.DefaultFileRegion.transferTo(DefaultFileRegion.java:130) at org.apache.celeborn.common.network.protocol.MessageWithHeader.transferTo(MessageWithHeader.java:119) at io.netty.channel.socket.nio.NioSocketChannel.doWriteFileRegion(NioSocketChannel.java:369) at io.netty.channel.nio.AbstractNioByteChannel.doWriteInternal(AbstractNioByteChannel.java:238) at io.netty.channel.nio.AbstractNioByteChannel.doWrite0(AbstractNioByteChannel.java:212) at io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:407) at io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:931) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.forceFlush(AbstractNioChannel.java:366) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:782) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:750)

24/12/05 04:48:14,935 ERROR [push-timeout-checker-1] PushDataHandler: PushData replication failed for partitionLocation: PartitionLocation[ id-epoch:2361-0 host-rpcPort-pushPort-fetchPort-replicatePort:10.189.190.60-35529-43703-35825-43867 mode:PRIMARY peer:(host-rpcPort-pushPort-fetchPort-replicatePort:10.186.111.100-34325-35069-34817-34751) storage hint:StorageInfo{type=MEMORY, mountPoint='/spark-local2/data', finalResult=false, filePath=} mapIdBitMap:null] org.apache.celeborn.common.exception.CelebornIOException: PUSH_DATA_TIMEOUT_REPLICA at org.apache.celeborn.common.network.client.TransportResponseHandler.failExpiredPushRequest(TransportResponseHandler.java:145) at org.apache.celeborn.common.network.client.TransportResponseHandler.lambda$new$0(TransportResponseHandler.java:113) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750