palantir/k8s-spark-scheduler

pods go in Pending state intermittently, scheduler restart solves the issue

hunny-garg opened this issue · 1 comments

We are facing an issue in our env where Spark pods go in Pending state intermittently. We have to restart Spark scheduler pods to fix the issue.
We are seeing below errors in spark-scheduler-extender logs...not sure this is related to the issue
Looking for some pointers to explain this odd behaviour.

k8s version: v1.23
spark-scheduler version: v0.58.0

"stacktrace": "error when looking for already bound reservations\nfailed to get resource reservations podName:agg-spark-350zvn28en0u-b29f74875b02ba23-exec-1, podNamespace:prod01\n\ngithub.com/palantir/k8s-spark-scheduler/internal/extender.(*ResourceReservationManager).FindAlreadyBoundReservationNode\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/internal/extender/resourcereservations.go:141\ngithub.com/palantir/k8s-spark-scheduler/internal/extender.(*SparkSchedulerExtender).selectExecutorNode\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/internal/extender/resource.go:382\ngithub.com/palantir/k8s-spark-scheduler/internal/extender.(*SparkSchedulerExtender).selectNode\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/internal/extender/resource.go:210\ngithub.com/palantir/k8s-spark-scheduler/internal/extender.(*SparkSchedulerExtender).Predicate\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/internal/extender/resource.go:151\ngithub.com/palantir/k8s-spark-scheduler/cmd.registerExtenderEndpoints.func1\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/cmd/endpoints.go:36\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2109\ngithub.com/palantir/witchcraft-go-server/wrouter.(*rootRouter).Register.func1.1\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/palantir/witchcraft-go-server/wrouter/router_root.go:136\ngithub.com/palantir/witchcraft-go-server/witchcraft/internal/middleware.NewRouteLogTraceSpan.func1\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/palantir/witchcraft-go-server/witchcraft/internal/middleware/route.go:107\ngithub.com/palantir/witchcraft-go-server/wrouter.(*routeRequestHandlerWithNext).HandleRequest\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/palantir/witchcraft-go-server/wrouter/router_root.go:150\ngithub.com/palantir/witchcraft-go-server/witchcraft/internal/middleware.NewRouteRequestLog.func1\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/palantir/witchcraft-go-server/witchcraft/internal/middleware/route.go:32\ngithub.com/palantir/witchcraft-go-server/wrouter.(*routeRequestHandlerWithNext).HandleRequest\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/palantir/witchcraft-go-server/wrouter/router_root.go:150\ngithub.com/palantir/witchcraft-go-server/witchcraft/internal/middleware.NewRequestMetricRequestMeter.func1\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/palantir/witchcraft-go-server/witchcraft/internal/middleware/request.go:168\ngithub.com/palantir/witchcraft-go-server/wrouter.(*routeRequestHandlerWithNext).HandleRequest\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/palantir/witchcraft-go-server/wrouter/router_root.go:150\ngithub.com/palantir/witchcraft-go-server/wrouter.(*rootRouter).Register.func1\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/palantir/witchcraft-go-server/wrouter/router_root.go:139\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2109\ngithub.com/julienschmidt/httprouter.(*Router).Handler.func1\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/julienschmidt/httprouter/router.go:275\ngithub.com/julienschmidt/httprouter.(*Router).ServeHTTP\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/julienschmidt/httprouter/router.go:387\ngithub.com/palantir/witchcraft-go-server/wrouter/whttprouter.(*router).ServeHTTP\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/palantir/witchcraft-go-server/wrouter/whttprouter/routerimpl.go:71\ngithub.com/palantir/witchcraft-go-server/witchcraft/internal/middleware.NewRequestExtractIDs.func1\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/palantir/witchcraft-go-server/witchcraft/internal/middleware/request.go:139\ngithub.com/palantir/witchcraft-go-server/wrouter.(*requestHandlerWithNext).ServeHTTP\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/palantir/witchcraft-go-server/wrouter/router_root.go:250\ngithub.com/palantir/witchcraft-go-server/witchcraft/internal/middleware.NewRequestContextLoggers.func1\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/palantir/witchcraft-go-server/witchcraft/internal/middleware/request.go:73\ngithub.com/palantir/witchcraft-go-server/wrouter.(*requestHandlerWithNext).ServeHTTP\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/palantir/witchcraft-go-server/wrouter/router_root.go:250\ngithub.com/palantir/witchcraft-go-server/witchcraft/internal/middleware.NewRequestContextMetricsRegistry.func1\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/palantir/witchcraft-go-server/witchcraft/internal/middleware/request.go:84\ngithub.com/palantir/witchcraft-go-server/wrouter.(*requestHandlerWithNext).ServeHTTP\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/palantir/witchcraft-go-server/wrouter/router_root.go:250\ngithub.com/palantir/witchcraft-go-server/witchcraft/internal/middleware.NewRequestPanicRecovery.func1.1\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/palantir/witchcraft-go-server/witchcraft/internal/middleware/request.go:42\ngithub.com/palantir/witchcraft-go-server/witchcraft/internal/negroni.(*Recovery).ServeHTTP\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/palantir/witchcraft-go-server/witchcraft/internal/negroni/recovery.go:193\ngithub.com/palantir/witchcraft-go-server/witchcraft/internal/middleware.NewRequestPanicRecovery.func1\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/palantir/witchcraft-go-server/witchcraft/internal/middleware/request.go:41\ngithub.com/palantir/witchcraft-go-server/wrouter.(*requestHandlerWithNext).ServeHTTP\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/palantir/witchcraft-go-server/wrouter/router_root.go:250\ngithub.com/palantir/witchcraft-go-server/wrouter.(*rootRouter).ServeHTTP\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/palantir/witchcraft-go-server/wrouter/router_root.go:103\nnet/http.serverHandler.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2947\nnet/http.initALPNRequest.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:3556\nnet/http.(*http2serverConn).runHandler\n\t/usr/local/go/src/net/http/h2_bundle.go:5910",

we also see below errors in spark-scheduler-extender container logs when this issue start occuring.

{"type":"service.1","time":"2023-04-08T02:39:45.830415574Z","level":"WARN","origin":"github.com/palantir/k8s-spark-scheduler","message":"found unexplained cache size difference","params":{"rrs":0,"rrsCached":109}}