uber/RemoteShuffleService

what may cause RssInvalidServerVersionException?

Opened this issue · 2 comments

Hi, I am wondering:

Q1. if RssInvalidServerVersionException will occur when RSS-i is restarted by a shell script as soon as it crashes due to some reasons meanwhile some applications are still using it. clients still stores the former RSS-i version but actually the version of the newly registered RSS-i is already changed.

# also the other exception may be caused by the same reason?
org.apache.spark.shuffle.FetchFailedException: Detected server restart, current server: Server{rss04.xxx:12203, 1675897753258, rss04xxx:/data/}, previous server: Server{rss04.xxxx:12203, 1675895945858, rss04xxx:/data/} at org.apache.spark.shuffle.RssShuffleManager$$anon$2.resolveConnection(RssShuffleManager.scala:220) at com.uber.rss.clients.ServerConnectionCacheUpdateRefresher.refreshConnection(ServerConnectionCacheUpdateRefresher.java:49) at com.uber.rss.clients.ServerIdAwareSyncWriteClient.connectImpl(ServerIdAwareSyncWriteClient.java:133) at

Q2. What may cause this exception :

org.apache.spark.shuffle.FetchFailedException: Cannot fetch shuffle 0 partition 362 due to RssAggregateException (RssShuffleStageNotStartedException (Shuffle not started: DataBlockSocketReadClient 274 [/10.2xxx44973 -> /10.20xxx:12212 (1xxxx28)])
com.uber.rss.exceptions.RssShuffleStageNotStartedException: Shuffle not started: DataBlockSocketReadClient 274 [/10.2xxxx:44973 -> /10.2xxx12212 (10.xxxx)]
	at com.uber.rss.clients.ClientBase.checkOKResponseStatus(ClientBase.java:291)
	at com.uber.rss.clients.ClientBase.readResponseStatus(ClientBase.java:275)
	at ...

Q1
You are right. This happened because server restarted and client had initially connected to earlier server. Ideally should not be an issue. Maybe we can remove this check @hiboyang ?

Q2
That basically means the server you are trying to connect to has not yet received the shuffle data for corresponding partition (Identified using appId, appAttemptId, shuffleId). Is this also happening when the server restarted?

Previously RSS does not handle server restart well, thus adding those check. Feel we could remove it.