Transport failure when running cluster for prolonged periods

Question

Transport failure when running cluster for prolonged periods

bravo0x0 opened this issue 4 years ago · 7 comments

Seeing the following error when a cluster is running for a long period of time:

2021-06-04 19:39:00.079 +00:00 [ERR] AssociationError [akka.tcp://MyActorSystem@172.31.5.166:8093] -> akka.tcp://MyActorSystem@172.31.6.202:8092: Error [Failed to write message to the transport] []
2021-06-04 19:39:00.153 +00:00 [ERR] Failed to write message to the transport
Akka.Remote.EndpointException: Failed to write message to the transport
---> System.InvalidOperationException: Collection was modified; enumeration operation may not execute.
at System.Collections.Generic.Dictionary2.Enumerator.MoveNext() at Hyperion.SerializerFactories.DefaultDictionarySerializerFactory.<>c__DisplayClass3_0.<BuildSerializer>b__1(Stream stream, Object obj, SerializerSession session) at Hyperion.Extensions.StreamEx.WriteObject(Stream stream, Object value, Type valueType, ValueSerializer valueSerializer, Boolean preserveObjectReferences, SerializerSession session) at lambda_method(Closure , Stream , Object , SerializerSession ) at Akka.Serialization.HyperionSerializer.ToBinary(Object obj) at Akka.Remote.MessageSerializer.Serialize(ExtendedActorSystem system, Address address, Object message) at Akka.Remote.EndpointWriter.WriteSend(Send send) --- End of inner exception stack trace --- at Akka.Remote.EndpointWriter.PublishAndThrow(Exception reason, LogLevel level, Boolean needToThrow) at Akka.Remote.EndpointWriter.WriteSend(Send send) at Akka.Remote.EndpointWriter.<Writing>b__27_0(Send s) at lambda_method(Closure , Object , Action1 , Action1 , Action1 )
at Akka.Actor.ReceiveActor.ExecutePartialMessageHandler(Object message, PartialAction`1 partialAction)
at Akka.Actor.ActorCell.<>c__DisplayClass114_0.<Akka.Actor.IUntypedActorContext.Become>b__0(Object m)
at Akka.Actor.ActorBase.AroundReceive(Receive receive, Object message)
at Akka.Actor.ActorCell.ReceiveMessage(Object message)
at Akka.Actor.ActorCell.Invoke(Envelope envelope)

The cluster was started at 2021-06-04 ~8.25 AM UTC
At 2021-06-04 ~18.41 UTC, this error was first seen
The same error was seen ~8 to 10 times over a 48 hour run.
The cluster has one seed node (port 8091), one node (172.31.6.202:8092) in one role, and 4 (172.31.5.166:8093, and 3 others) in another role
Each of the xx.8093 roles reported the above error at different points in time.
The xx.8092 node didn't appear to have any significant changes in memory of CPU usage over the entire time
On the xx.8092 node, the following error was seen;
===> 2021-06-04 18:41:53.288 +00:00 [Error] Disassociated
Akka.Remote.EndpointDisassociatedException: Disassociated
at Akka.Remote.EndpointWriter.PublishAndThrow(Exception reason, LogLevel level, Boolean needToThrow)
at Akka.Remote.EndpointWriter.Unhandled(Object message)
at Akka.Actor.UntypedActor.Receive(Object message)
at Akka.Actor.ActorBase.AroundReceive(Receive receive, Object message)
at Akka.Actor.ActorCell.ReceiveMessage(Object message)
at Akka.Actor.ActorCell.ReceivedTerminated(Terminated t)
at Akka.Actor.ActorCell.AutoReceiveMessage(Envelope envelope)
at Akka.Actor.ActorCell.Invoke(Envelope envelope)

Akka versions are as below and consistent across all nodes:
Akka: 1.4.14
Akka.Cluster: 1.4.14
Akka.Remote: 1.4.14
Akka.Serialization.Hyperion: 1.4.14
Akka.Logger.Serilog: 1.4.11

Dispatchers are fork-join dispatchers, cluster configuration is as below, and consistent across all nodes:

remote
{
        command-ack-timeout = 60s
    	handshake-timeout = 60s
	dot-netty.tcp
	{
		port = 8092
        		connection-timeout = 60s
	}
}

cluster 
{
	gossip-interval = 10s
	seed-node-timeout = 30s
	use-dispatcher = cluster-dispatcher

	failure-detector
	{
		threshold = 12
		heartbeat-interval = 20s
		acceptable-heartbeat-pause = 60s
		expected-response-after = 120s
	}
}

cluster-dispatcher
{
type = "Dispatcher"
executor = "fork-join-executor"
fork-join-executor
{
parallelism-min = 8
parallelism-max = 64
}
}

Is this error transient, and are there settings to change to mitigate this?

Thanks.

Answer 1 · 2021-06-07T13:48:42.000Z

This is a Hyperion error. Which version of it are you using?

Answer 2 · 2021-06-07T14:44:12.000Z

Hi,

The versions are:
Akka: 1.4.14
Akka.Cluster: 1.4.14
Akka.Remote: 1.4.14
Akka.Serialization.Hyperion: 1.4.14
Akka.Logger.Serilog: 1.4.11
All nodes are running the same versions.
Thanks!

Answer 3 · 2021-06-07T15:03:16.000Z

I thought this issue might be a duplicate of akkadotnet/akka.net#4218, which was resolved via #165 and released in Hyperion v0.9.15.... But the version you're using is v0.9.16 - and in the original akkadotnet/akka.net#4218 issue it's a List<T> being modified whereas you're using a Dictionary<TKey, TValue>.

I'll assume that some of the new handling we added for Dictionary support is what's barfing here. We'll see about getting this fixed.

Answer 4 · 2021-06-07T18:37:56.000Z

Thanks. If I understand the thread you referenced for #4218, the cause could be a dictionary in my code, that's modified after the call to send it. I will check for where that's being, and will send a copy instead.

Answer 5 · 2021-06-07T19:25:45.000Z

That would be very helpful @bravo0x0 - I should have mentioned that in the original issue.

Answer 6 · 2021-06-09T04:26:10.000Z

Hi, I found one dictionary that I was sending without cloning. I changed it to make a copy before send; it has now been 12+ hours running, and no errors have come up. Shall I close this issue? Thanks!

Answer 7 · 2021-06-09T13:52:20.000Z

Yep, go right ahead and close this issue if you don't see the error appear again