KubeClient logs errors for non-error scenarios
pmsanford opened this issue · 6 comments
GetCustomObject logs an error if the object requested doesn't exist. When a silo starts up and attempts to insert a new row for itself, InsertRow checks to see if the silo object already exists by calling GetCustomObject. The "happy path" for a new silo spinning up is for when this custom object doesn't exist, so every new silo that spins up with throw an error.
On further inspection, it looks like the GetCustomObject case is a bug (see PR) but this is a problem with all the KubeClient methods. For example, when a silo starts up, it attempts to initialize the membership table. If it's not the first silo ever to start in the Kube cluster, the creation of those objects will fail with AlreadyExists. The end result of this is ~4 errors logged every time a new silo starts up in a normal way in an existing, healthy cluster.
Thanks for the issue and the PR @pmsanford.
However I would like to clarify why it was made that way.
Kubernetes API objects are stored on the internal etcd service. That means, it takes a while to a created object to become available.
By the time I wrote this package, it was supposed to be an initial/fast and easy way to get Orleans running on Kubernetes without need to use an external storage for the MBR provider.
Kubernetes has a way to monitor a resource when it was created and become available. The create object Task
will be completed right after it is called. However, that doesn't mean the object is available for use right away because it wasn't commit yet on etcd backend.
With that said, the proper way to do it, is to use a watch
API call that wait for a particular resource to be available before try to do anything.
The way it is today, we will eventually hit a race condition when 2 silos spin up at the same time and the CRDs weren't deployed yet.
The reason I left it the way it is, is because in 99% of the cases, people would pre-deploy the CRD objects before start running a silo so this corner/race condition would never happen.
In another project I'm working with (http://kubeleans.io/), @attilah is working in a new API client which has support to watch
operations so those kind of issues would never happen.
Before we merge the PR, I would like to ask @attilah to drop his 2 cents on it.
Thanks for the detailed response. So does that mean this project isn't intended for production use? Or that the new API client would be merged back down to this project once finished?
So does that mean this project isn't intended for production use?
No, it means:
- I assumed that you would not give full CRD write access to the runtime principal running Orleans. Meaning that you some administrator would deploy the CRD from the .yaml files before start the first silo.
- Assuming 1, the auto-create feature is meant for unit tests and local test-only.
Or that the new API client would be merged back down to this project once finished?
- Assuming 1 and 2, and considering what I explained about
watch
operations and to support uncommon scenarios where people give full write access to the runtime principal running Orleans, we would allow the CRDs to be created at runtime by properly usingwatch
APIs.
I'll wait a bit more (on both projects) so @attilah can find some spare time to finish the watch
support. If that takes much longer, I'll do it myself.
In other words, if you want to use it in production now as I mentioned, first deploy the CRD
s from the .yaml files, and then start your first silo.
If I just blindly merge your PR, that will make the initializatiokn sequence to accept NotFound
response, which is undesirable.
Right, sorry if that came off as offensive or pushy, just making sure I'm using the right library (thought maybe you were pointing me to Kubeleans to use instead). Thanks again for explaining.
Nah, no worries, no offenses take 😄
Kubeleans has a different purpose. Run Orleans Grains as a serverless functions. The reason I linked it, is just because it uses the same Kubernetes API and Orleans on top of it. We noticed we need the watch
support and it is in the works and then I will port it back to this provider so people can try it out.
Will update you soon thru this issue when we get there.
#11 was merged
Thanks!