uber/RemoteShuffleService

Corrupted block detected during decompression

YutingWang98 opened this issue · 0 comments

Hi, we are seeing some zstd corruption error during shuffle read recently.

org.apache.spark.SparkException: Job aborted due to stage failure: Task 300 in stage 7.0 failed 4 times, most recent failure: Lost task 300.3 in stage 7.0 (TID 5866) (100.65.134.162 executor 200): com.github.luben.zstd.ZstdException: Corrupted block detected
	at com.github.luben.zstd.ZstdDecompressCtx.decompressByteArray(ZstdDecompressCtx.java:216)
	at com.github.luben.zstd.Zstd.decompressByteArray(Zstd.java:409)
	at org.apache.spark.shuffle.rss.BlockDownloaderPartitionRecordIterator.fetchNextDeserializationIterator(BlockDownloaderPartitionRecordIterator.scala:178)

It seems not related to the input files since the spark job succeeded after we retry. Any ideas why and if this is related to rss client/server? Thanks