OrleansContrib/Orleans.Clustering.Kubernetes

Different Applications colliding

jonathansant opened this issue · 15 comments

I've followed your samples and have set up an application on our kube cluster using the following [default] settings:

silo:

return siloHost
	.ConfigureEndpoints(
		new Random(1).Next(30001, 30100),
		new Random(1).Next(20001, 20100),
		listenOnAnyHostAddress: true
	)
	.UseKubeMembership(opt =>
	{
		opt.CanCreateResources = true;
		//opt.DropResourcesOnInit = true;
});

client:

clientBuilder.UseKubeGatewayListProvider();

This works fine. However, when I added the second application things started getting weird. The apps started behaving differnetly with every request, sometimes working, sometimes throwing:

Unexpected: Cannot find an implementation class for grain interface 1039391738

and sometimes throwing: Cannot find an implementation class for grain interface: Odin.Gaming.Contracts.Cache.IGameCacheWorkerGrain

The apps comprise each of 4 pods with a web api client and 8 pods for the silos. I suspect that somehow the apps are conflicting.

Is this some misconfiguration on my part?
Thanks.

Hey,

Yes, you have a missconfig/missconcept on how it work. To run multiple clusters under the same Kubernetes namespace (in this case I'm seeing you are using the default one) you need to configure the ServiceId (configure ClusterOptions in cluster builder) in Orleans to an unique value per cluster (in this case your app). That will make sure you can share the same membership storage (in this case Kubernetes objects) among multiple services (your apps).

You can also create a new namespace and deploy your new application that. That will make sure you have full isolation.

Please let me know if that solved your problem.

builder.Configure<ClusterOptions>(options =>
  options.ServiceId = options.ClusterId = "my-service");

For both ClientBuilder and SiloHostBuilder

Thanks for your reply. However, we are setting both ClusterId (app version + environment) and ServiceId (app name).

Is it possible that the ports are effecting since we're are setting the same port range for all the apps?

I checked out what is being written in the clusterversions and silos resource and noticed that only the clusterId is being used. We can change our cluster id to include the appname as well. However, since we have differnet app versions the cluster id is still unique and therefore the cluster should not conflict but the issue still persists.

kubectl get silo -n demo
NAME                            AGE
100.96.10.155-11112-262789991   23m
100.96.10.156-11112-262789992   23m
100.96.10.157-11111-262789997   23m
100.96.10.158-11111-262789996   23m
100.96.11.180-11112-262789990   23m
100.96.11.181-11112-262789992   23m
100.96.11.182-11111-262789996   23m
100.96.11.183-11111-262789997   23m
100.96.7.121-11112-262789990    23m
100.96.7.122-11111-262789996    23m
100.96.8.94-11112-262789990     23m
100.96.8.95-11111-262789996     23m
100.96.9.145-11112-262789990    23m
100.96.9.146-11112-262789991    23m
100.96.9.147-11111-262789996    23m
100.96.9.148-11111-262789997    23m

kubectl get clusterversions -n demo
NAME              AGE
1.25.4-dev-demo   23m
1.28.5-dev-demo   23m

That is really weird... It is supposed to work with multiple cluster Ids... I'll investigate...

Just FYI, I managed to go around the issue by setting the Group Name as the full name of the app i.e. appName-version-environment. Although I still think that service discovery should work based on the Cluster and Service Ids.

@galvesribeiro any updates on this? I also noticed that when I scaled my app from 8 silos to 1 the client couldn't connect to the cluster anymore and started noticing logs that the remaining silo started trying to ping the other silos to no avail. It seems like the other silos did not mark themselves dead in the membership table.

Is it enough to do kubectl get silo -n demo to get membership information because it seems like there isn't enough information (like silo status, etc...)? If not do you know how to get this information using kubectl or otherwise, thanks.

@jonathansant I'm very sorry for my delay on this. I've being OOO recently and not much watching GH for personal reasons.

I'll dig into this over the weekend and report back to you my findings.

Is it enough to do kubectl get silo -n demo to get membership information because it seems like there isn't enough information (like silo status, etc...)?

The membership objects are stored on Kubernetes etcd, so it should be available that data there on kubectl as well. It is probably that you need to specify which columns to show. I'll get a sample once I have it working.

@jonathansant just came out of my mind... You can try kubectl get silo -n yournamespace -o=json. I think that will bring to you the whole object structure IIRC. You can use yaml instead if you prefer.

Will dig into. Thanks

Just a heads up... I'm going to integrate the new Kube API client we added to Kubeleans on this package asap. Sorry, I've being REALLY busy those days.

Will get it up ASAP and hopefully it will sort your problem among others on that PR.

Thanks for the patience. Will keep u posted here.

@galvesribeiro Just to check on the progress of the next release? :)

Hello @jonathansant

Don't lose faith :) I just head back from an emergency trip. Getting to it until the weekend.

Thanks for the patience.

@jonathansant I talked with @galvesribeiro yesterday about possible things here and since you also worked your way through some configuration options at this point it is hard to see the "real" problem here.

From the membership provider part we don't see an issue, we're compatible with existing membership providers in terms of what we're storing and managing about an Orleans cluster.

I think that from deployment perspective when you've deployed multiple applications and had some problems that would repro with an SQL membership provider too, so I think that perhaps not a bug, but a misconfigured deployment (based on available info).

When you hit the grain implementation not found thing that's for sure a mixup of silos on the same subnet and somehow different clients/silos connection to incompatible ones (with different set of grain dlls).

To be able to resolve this issue (and to see clearly WHAT is the issue), could you please provide repro configs and minimal code for the several issues you were hitting?

Thanks!

@jonathansant Hey, can you please provide the repro that @attilah asked for?

Thanks!

I've release 1.0.19 which makes the client consider the ClusterId. Please try it out and see if your conflicts exist.

I'm closing the issue as there were no activities in a while.

Please fee free to reply if you have further problems and we can re-open if necessary.

Thanks