asynkron/protoactor-dotnet

Proto.Remote 0.33.0 can't talk to 1.1.0

DenizPiri opened this issue · 4 comments

When we upgraded a portion of our servers to 1.1.0 from 0.33.0, we noticed that the two can't communicate with each other anymore.

I believe this is because of the newly added request_id field to PID proto. That should be marked as optional and IMO shouldn't even exist as a part of the proto as it doesn't make sense to serialize that.

As pointed out by Üstün on the slack channel, protobuf3 fields are all optional. So I am not so confident now that it is related to that new field.

image

This is the call stack I get on the side of the 0.33.0. Looking at the error message and stack trace, it looks like it is because of the PID protobuf struct. I assume when 0.33.0 receives that request_id field, it tries to add it to UnknownFields set, and something goes wrong there which causes this error.

I managed to spot the probem.

i am very confident it is because of this chnage:
C38C2EB8-78CE-4642-8518-4EC6FCAC9852

This was a part of this commit: af14b11

I guess there wont be any easy way to fix this. Next version will either break 1.1.0 compatibility or 0.33.0 older compatibility.

@rogeralsing Are there any plans to put more attention to backwards compatibility, at least when it comes to proto.remote? It should be pretty easy as long as field type-id pairs never change. Because of this issue we are essentially stuck forever with 0.33.0.

Do you have any suggestions on how to describe this to the community in a more clear way?

I know this is not the answer you want, but this is my reasoning up to 1.0:
Do not expect compatibility between alpha/beta versions, the framework design were not set in stone and is evolving.
Going from 0.x to 1.x is a major version bump and signals that the API is not compatible.
From start to 1.0 there have been a lot of protocol changes. e.g. structure of remote protocol, request-id added to PID. changes in cluster gossip protocol etc.

Going forward, API and wire protocol should be compatible for the life span of 1.x.

Because of this issue we are essentially stuck forever with 0.33.0.

Is this due to your deployment model? e.g. are there multiple micro-services that rely on Proto.Remote as the means of communication?

I was incorrectly assuming that Proto.Remote must have been stable.
I think simply having a changelog that includes a "breaking changes" section would greatly help.

Yes, the issue is with our deployment model. We have a group of servers that communicates with another group of servers directly via Proto.Remote. If those two groups of servers can't communicate, the product doesn't work. We can't just restart the first group of servers, as it has stateful connections that last up to 30 mins. So, we do a rolling deployment for those. While at 0.33.0, we could switch to another communication system between those servers, update all, then update to 1.1.0, and switch to Proto.Remote back again for that.
For the second group, we use Proto.Cluster, and we shut those down all at once and bring the new versions back up, so there is no need for rolling deploys there. However, I can imagine many businesses wanting to do rolling deploys for Proto.Cluster instances too.

Maybe making sure that sequential minor versions are compatible with each other when it comes to Proto.Remote would be nice. Endpoints could exchange their version codes upon connection and send the struct that receiving end understands when communicating. I guess the same guarantee could be extended to Proto.Cluster too with a bit more effort.

Feel free to close this issue, I don't think there is an easy and immediate solution to our problem.