porch: porch-server crash with "concurrent map iteration and map write"
johnbelamaric opened this issue · 14 comments
Expected behavior
No crashing.
Actual behavior
porch-server crashed with "concurrent map iteration and map write" and "fatal error: I0512 17:06:29.730904 1 watch.go:238] sending watch event {MODIFIED 0xc000524d20}".
Information
This is during some testing of PackageVariantSet / PackageVariant. In that testing, there were a bunch of errors like incorrect CRDs loaded, missing permissions, etc. that led to a lot of churn of resource creation, deletion, and error handling code getting exercised a lot.
Logs attached, also available at https://cloudlogging.app.goo.gl/q4nwxkeYcfeWtMnn6 - let me know if you need permissions, I think some folks have them to that project already.
This is using a dev build of Porch, though there are no changes in Porch Server compared to v0.0.17, only in the controllers.
I will try to take a look at it. I addressed some concurrency issues in #3900, but clearly there are some remaining issues.
Hi, I have the same issue. But in my scenario, it doesn't happen periodically. It happens once in a while. I am not really sure if it has anything to do with any sysctl
limits but in case it is needed I can provide them.
Here are the porch-server logs.
To re-produce:
I have encountered this problem in Nephio which uses porch-server.
I have an experimental build, I am trying it out. But if you want to try it out, you can:
k -n porch-system edit deployment porch-server
change the deployment to image: nephio/porch-server:jbelamaric
and also set imagePullPolicy: Always
so that if I push another one we just have to restart the deployment.
Perfect, I am trying it now. I will share the output after observing it for a couple of hours.
I am still seeing the same crash in my environment.
So, what I am seeing in the latest crashes is that the API server code is serving up a resource, presumably directly from cache because there is no porch-specific code in the call stack, it's all API machinery stuff. It's trying to render the ObjectMeta.Labels field, and detecting a change in that map during iteration, right here:
This is in my experimental build which is based on newer libs than the released code. So, I am looking at the other goroutines running and seeing which may have touched the metadata.Labels.
ok, new experimental build out - if you want to try it then k -n porch-system rollout restart deployment porch-server
should be sufficient. I have it running in my environment now, I'll let it run for a while and see if I found the problem.
I don't have the same behavior so far. It is been running for 3 hours now. I think we have different kinds of workloads in our clusters. Probably it will be good to have a similar environment to compare.
Okay, even I have a crash now. I will try the new image.
So far running 8 hours, no crashes 🤞
ubuntu@nephio-r1-e2e:~/test-infra/e2e/tests$ k -n porch-system get po
NAME READY STATUS RESTARTS AGE
function-runner-7c5766c7d6-262hr 1/1 Running 0 9d
function-runner-7c5766c7d6-vw6d8 1/1 Running 0 9d
porch-controllers-c694cd64d-566p7 1/1 Running 0 9d
porch-server-589c488f-q8xj5 1/1 Running 0 8h
ubuntu@nephio-r1-e2e:~/test-infra/e2e/tests$
And a second instance, which ran through the entire Nephio e2e suite, is also running with no crashes for the last 2+ hours:
ubuntu@nephio-experimental:~$ k get po -n porch-system
NAME READY STATUS RESTARTS AGE
function-runner-7b94ccf78-8dqvd 1/1 Running 0 136m
function-runner-7b94ccf78-s5f6k 1/1 Running 0 136m
porch-controllers-84df9d8f4b-b2n94 1/1 Running 0 136m
porch-server-597dc9cc88-gpx2x 1/1 Running 0 136m
ubuntu@nephio-experimental:~$
I have been running it for 2 days and 11 hours and I see a crash, but the logs are a bit different this time.
kubectl get pods -n porch-system
NAME READY STATUS RESTARTS AGE
function-runner-7c5766c7d6-jw2dh 1/1 Running 0 2d15h
function-runner-7c5766c7d6-q7n8z 1/1 Running 0 2d15h
porch-controllers-c694cd64d-9896k 1/1 Running 0 2d15h
porch-server-5776796488-kxwwv 1/1 Running 1 (38m ago) 2d11h
I see this in the logs
E0814 05:14:35.311801 1 apf_controller.go:411] "Unable to derive new concurrency limits" err="impossible: ran out of bounds to consider in bound-constrained problem" plNames=[workload-low catch-all global-default leader-election node-high system workload-high] items=[{target:NaN lowerBound:41 upperBound:845} {target:13 lowerBound:13 upperBound:613} {target:24 lowerBound:24 upperBound:649} {target:25 lowerBound:25 upperBound:625} {target:73 lowerBound:73 upperBound:698} {target:50 lowerBound:50 upperBound:674} {target:49 lowerBound:49 upperBound:698}]
E0814 05:14:45.315947 1 apf_controller.go:411] "Unable to derive new concurrency limits" err="impossible: ran out of bounds to consider in bound-constrained problem" plNames=[system workload-high workload-low catch-all global-default leader-election node-high] items=[{target:50 lowerBound:50 upperBound:674} {target:49 lowerBound:49 upperBound:698} {target:NaN lowerBound:41 upperBound:845} {target:13 lowerBound:13 upperBound:613} {target:24 lowerBound:24 upperBound:649} {target:25 lowerBound:25 upperBound:625} {target:73 lowerBound:73 upperBound:698}]
That is an unrelated issue to this original crash. In my two instances, I see the same thing - there are no crashes after about 2.5 days of the original sort, and one of these in each. It seems to be related to some sort of timeouts.
Can you create a new issue for that and post the log on it? I will add my logs to it afterward.