porch: porch-server crash with "concurrent map iteration and map write"

Question

porch: porch-server crash with "concurrent map iteration and map write"

johnbelamaric opened this issue 2 years ago · 14 comments

Expected behavior

No crashing.

Actual behavior

porch-server crashed with "concurrent map iteration and map write" and "fatal error: I0512 17:06:29.730904 1 watch.go:238] sending watch event {MODIFIED 0xc000524d20}".

Information

This is during some testing of PackageVariantSet / PackageVariant. In that testing, there were a bunch of errors like incorrect CRDs loaded, missing permissions, etc. that led to a lot of churn of resource creation, deletion, and error handling code getting exercised a lot.

Logs attached, also available at https://cloudlogging.app.goo.gl/q4nwxkeYcfeWtMnn6 - let me know if you need permissions, I think some folks have them to that project already.

This is using a dev build of Porch, though there are no changes in Porch Server compared to v0.0.17, only in the controllers.

downloaded-logs-20230512-115730.json.gz

Answer 1 · 2023-05-12T19:13:34.000Z

Seems to be happening periodically:

Answer 2 · 2023-05-13T22:43:44.000Z

I will try to take a look at it. I addressed some concurrency issues in #3900, but clearly there are some remaining issues.

Answer 3 · 2023-08-10T18:01:11.000Z

Hi, I have the same issue. But in my scenario, it doesn't happen periodically. It happens once in a while. I am not really sure if it has anything to do with any sysctl limits but in case it is needed I can provide them.

Here are the porch-server logs.

To re-produce:

I have encountered this problem in Nephio which uses porch-server.

Answer 4 · 2023-08-10T22:28:20.000Z

I have an experimental build, I am trying it out. But if you want to try it out, you can:

k -n porch-system edit deployment porch-server

change the deployment to image: nephio/porch-server:jbelamaric and also set imagePullPolicy: Always so that if I push another one we just have to restart the deployment.

Answer 5 · 2023-08-11T10:01:08.000Z

Perfect, I am trying it now. I will share the output after observing it for a couple of hours.

Answer 6 · 2023-08-11T16:06:16.000Z

I am still seeing the same crash in my environment.

Answer 7 · 2023-08-11T16:33:43.000Z

So, what I am seeing in the latest crashes is that the API server code is serving up a resource, presumably directly from cache because there is no porch-specific code in the call stack, it's all API machinery stuff. It's trying to render the ObjectMeta.Labels field, and detecting a change in that map during iteration, right here:

https://github.com/kubernetes/apimachinery/blob/dd46389197a396927dadb194ae5fc9dbebeb5a31/pkg/apis/meta/v1/generated.pb.go#L2715

This is in my experimental build which is based on newer libs than the released code. So, I am looking at the other goroutines running and seeing which may have touched the metadata.Labels.

Answer 8 · 2023-08-11T16:55:27.000Z

ok, new experimental build out - if you want to try it then k -n porch-system rollout restart deployment porch-server should be sufficient. I have it running in my environment now, I'll let it run for a while and see if I found the problem.

Answer 9 · 2023-08-11T17:30:31.000Z

I don't have the same behavior so far. It is been running for 3 hours now. I think we have different kinds of workloads in our clusters. Probably it will be good to have a similar environment to compare.

Answer 10 · 2023-08-11T18:43:20.000Z

Okay, even I have a crash now. I will try the new image.

Answer 11 · 2023-08-12T01:08:58.000Z

So far running 8 hours, no crashes 🤞

ubuntu@nephio-r1-e2e:~/test-infra/e2e/tests$ k -n porch-system get po
NAME                                READY   STATUS    RESTARTS   AGE
function-runner-7c5766c7d6-262hr    1/1     Running   0          9d
function-runner-7c5766c7d6-vw6d8    1/1     Running   0          9d
porch-controllers-c694cd64d-566p7   1/1     Running   0          9d
porch-server-589c488f-q8xj5         1/1     Running   0          8h
ubuntu@nephio-r1-e2e:~/test-infra/e2e/tests$

Answer 12 · 2023-08-12T01:10:19.000Z

And a second instance, which ran through the entire Nephio e2e suite, is also running with no crashes for the last 2+ hours:

ubuntu@nephio-experimental:~$ k get po -n porch-system
NAME                                 READY   STATUS    RESTARTS   AGE
function-runner-7b94ccf78-8dqvd      1/1     Running   0          136m
function-runner-7b94ccf78-s5f6k      1/1     Running   0          136m
porch-controllers-84df9d8f4b-b2n94   1/1     Running   0          136m
porch-server-597dc9cc88-gpx2x        1/1     Running   0          136m
ubuntu@nephio-experimental:~$

Answer 13 · 2023-08-14T05:56:02.000Z

I have been running it for 2 days and 11 hours and I see a crash, but the logs are a bit different this time.

kubectl get pods -n porch-system 
NAME                                READY   STATUS    RESTARTS      AGE
function-runner-7c5766c7d6-jw2dh    1/1     Running   0             2d15h
function-runner-7c5766c7d6-q7n8z    1/1     Running   0             2d15h
porch-controllers-c694cd64d-9896k   1/1     Running   0             2d15h
porch-server-5776796488-kxwwv       1/1     Running   1 (38m ago)   2d11h

I see this in the logs

E0814 05:14:35.311801       1 apf_controller.go:411] "Unable to derive new concurrency limits" err="impossible: ran out of bounds to consider in bound-constrained problem" plNames=[workload-low catch-all global-default leader-election node-high system workload-high] items=[{target:NaN lowerBound:41 upperBound:845} {target:13 lowerBound:13 upperBound:613} {target:24 lowerBound:24 upperBound:649} {target:25 lowerBound:25 upperBound:625} {target:73 lowerBound:73 upperBound:698} {target:50 lowerBound:50 upperBound:674} {target:49 lowerBound:49 upperBound:698}]
E0814 05:14:45.315947       1 apf_controller.go:411] "Unable to derive new concurrency limits" err="impossible: ran out of bounds to consider in bound-constrained problem" plNames=[system workload-high workload-low catch-all global-default leader-election node-high] items=[{target:50 lowerBound:50 upperBound:674} {target:49 lowerBound:49 upperBound:698} {target:NaN lowerBound:41 upperBound:845} {target:13 lowerBound:13 upperBound:613} {target:24 lowerBound:24 upperBound:649} {target:25 lowerBound:25 upperBound:625} {target:73 lowerBound:73 upperBound:698}]

Answer 14 · 2023-08-14T15:19:12.000Z

That is an unrelated issue to this original crash. In my two instances, I see the same thing - there are no crashes after about 2.5 days of the original sort, and one of these in each. It seems to be related to some sort of timeouts.

Can you create a new issue for that and post the log on it? I will add my logs to it afterward.