kubeflow/spark-operator

[BUG] Spark Operator Lock identity is empty while HA

tankim opened this issue · 3 comments

Description

Please provide a clear and concise description of the issue you are encountering, and a reproduction of your configuration.

If your request is for a new feature, please use the Feature request template.

  • [v] ✋ I have searched the open/closed issues and my issue is not listed.

Reproduction Code [Required]

Steps to reproduce the behavior:

  • Just setting replicaCount higher than 1
replicaCount: 2

Expected behavior

  • Launch additional operator pod

Actual behavior

  • Got error from both operator pod

Terminal Output Screenshot(s)

+ uidentry=root:x:0:0:root:/root:/bin/bash
+ set -e
+ echo 0
+ echo 0
+ echo root:x:0:0:root:/root:/bin/bash
0
0
root:x:0:0:root:/root:/bin/bash
+ [[ -z root:x:0:0:root:/root:/bin/bash ]]
+ exec /usr/bin/tini -s -- /usr/bin/spark-operator -v=4 -logtostderr -namespace= -enable-ui-service=true -ingress-url-format= -controller-threads=600 -resync-interval=30 -enable-batch-scheduler=false -label-selector-filter= -enable-metrics=true -metrics-labels=app_type -metrics-port=10254 -metrics-endpoint=/metrics -metrics-prefix= -enable-webhook=true -webhook-svc-namespace=dataplatform-common-dev -webhook-port=8080 -webhook-timeout=30 -webhook-svc-name=spark-operator-webhook -webhook-config-name=spark-operator-webhook-config -webhook-namespace-selector=spark-webhook-enabled=true -enable-resource-quota-enforcement=false -leader-election=true -leader-election-lock-namespace=dataplatform-common-dev -leader-election-lock-name=spark-operator-lock
F0615 02:58:37.044201      10 main.go:146] Lock identity is empty

goroutine 1 [running]:
github.com/golang/glog.Fatal(...)
	/go/pkg/mod/github.com/golang/glog@v1.2.1/glog.go:664
main.main()
	/workspace/main.go:146 +0x1418

SIGABRT: abort
PC=0x40708e m=2 sigcode=18446744073709551610

goroutine 1 gp=0xc0000061c0 m=2 mp=0xc000092808 [running, locked to thread]:
runtime/internal/syscall.Syscall6()
	/usr/local/go/src/runtime/internal/syscall/asm_linux_amd64.s:36 +0xe fp=0xc0004cba88 sp=0xc0004cba80 pc=0x40708e
syscall.RawSyscall6(0xc00034e038?, 0xc0006a0120?, 0xc00060c060?, 0x2be5440?, 0x548220?, 0x2be54d8?, 0xc0004cbaf0?)
	/usr/local/go/src/runtime/internal/syscall/syscall_linux.go:38 +0xd fp=0xc0004cbad0 sp=0xc0004cba88 pc=0x40706d
syscall.RawSyscall(0x2be54d8?, 0x0?, 0xc0004cbb70?, 0xc0004cbb50?)
	/usr/local/go/src/syscall/syscall_linux.go:62 +0x15 fp=0xc0004cbb18 sp=0xc0004cbad0 pc=0x48a8f5
syscall.Tgkill(0xba?, 0x0?, 0x0?)
	/usr/local/go/src/syscall/zsyscall_linux_amd64.go:894 +0x25 fp=0xc0004cbb48 sp=0xc0004cbb18 pc=0x488aa5
github.com/golang/glog.abortProcess()
	/go/pkg/mod/github.com/golang/glog@v1.2.1/glog_file_linux.go:35 +0x87 fp=0xc0004cbb90 sp=0xc0004cbb48 pc=0x548387
github.com/golang/glog.ctxfatalf({0x0?, 0x0?}, 0xc000280110?, {0x1b8f1eb?, 0x411d65?}, {0xc000280110?, 0x185ca80?, 0xc000328201?})
	/go/pkg/mod/github.com/golang/glog@v1.2.1/glog.go:647 +0x6a fp=0xc0004cbbf8 sp=0xc0004cbb90 pc=0x54606a
github.com/golang/glog.fatalf(...)
	/go/pkg/mod/github.com/golang/glog@v1.2.1/glog.go:657
github.com/golang/glog.FatalDepth(0x1, {0xc000280110, 0x1, 0x1})
	/go/pkg/mod/github.com/golang/glog@v1.2.1/glog.go:670 +0x57 fp=0xc0004cbc48 sp=0xc0004cbbf8 pc=0x5461f7
github.com/golang/glog.Fatal(...)
	/go/pkg/mod/github.com/golang/glog@v1.2.1/glog.go:664
main.main()
	/workspace/main.go:146 +0x1418 fp=0xc0004cbf50 sp=0xc0004cbc48 pc=0x172f418
runtime.main()
	/usr/local/go/src/runtime/proc.go:271 +0x29d fp=0xc0004cbfe0 sp=0xc0004cbf50 pc=0x4404fd
runtime.goexit({})
	/usr/local/go/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc0004cbfe8 sp=0xc0004cbfe0 pc=0x473721

Environment & Versions

  • Spark Operator App version: v1beta2-1.4.6-3.5.0
  • Helm Chart Version: 1.2.15
  • Kubernetes Version: 1.28
  • Apache Spark version: 3.5.0

Additional context

Honestly I don't see a need to run multiple replicas for HA purpose. Kubernetes Deployment controller is essentially providing the HA feature out of the box

I fixed this with new version of helm chart version 1.4.0.

Honestly I don't see a need to run multiple replicas for HA purpose. Kubernetes Deployment controller is essentially providing the HA feature out of the box

In our current workload, tens to hundreds of Spark applications are triggered simultaneously, and this number may grow to thousands in the future. In this process, if the operator pod becomes unstable, we believe that an HA setup is necessary to ensure stable operation (aiming for zero downtime). This can vary depending on the specific issues we are currently facing.