Operator has invalid memory address error on specific pytorchjob spec
ca-scribner opened this issue · 1 comments
ca-scribner commented
When running the following yaml,
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: my-pytorchjob
namespace: my-namespace
spec:
activeDeadlineSeconds: -1
cleanPodPolicy: Running
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
containers:
- args:
- --backend
- gloo
image: pytorch-dist-mnist # (from examples folder)
name: pytorch
# imagePullSecrets:
# - name: image-pull-secret
Worker:
replicas: 3
restartPolicy: OnFailure
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
containers:
- args:
- --backend
- gloo
image: pytorch-dist-mnist # (from examples folder)
name: pytorch
# imagePullSecrets:
# - name: image-pull-secret
ttlSecondsAfterFinished: -1
I encounter a memory address/nil pointer error putting the operator into an infinite crash loop:
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc001c09c70, 0x3b9aca00, 0x0, 0x1, 0xc0000c2180)
/go/pkg/mod/k8s.io/apimachinery@v0.15.10-beta.0/pkg/util/wait/wait.go:153 +0xf8
k8s.io/apimachinery/pkg/util/wait.Until(0xc001c09c70, 0x3b9aca00, 0xc0000c2180)
/go/pkg/mod/k8s.io/apimachinery@v0.15.10-beta.0/pkg/util/wait/wait.go:88 +0x4d
created by github.com/kubeflow/pytorch-operator/pkg/controller.v1/pytorch.(*PyTorchController).Run
/go/src/github.com/kubeflow/pytorch-operator/pkg/controller.v1/pytorch/controller.go:202 +0x2c4
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x1275e83]
goroutine 210 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
/go/pkg/mod/k8s.io/apimachinery@v0.15.10-beta.0/pkg/util/runtime/runtime.go:58 +0x105
panic(0x13f3ea0, 0x2213c70)
/usr/local/go/src/runtime/panic.go:679 +0x1b2
github.com/kubeflow/pytorch-operator/pkg/controller.v1/pytorch.(*PyTorchController).cleanupPyTorchJob(0xc000149040, 0xc00028fc80, 0x0, 0x0)
/go/src/github.com/kubeflow/pytorch-operator/pkg/controller.v1/pytorch/job.go:194 +0x73
github.com/kubeflow/pytorch-operator/pkg/controller.v1/pytorch.(*PyTorchController).reconcilePyTorchJobs(0xc000149040, 0xc00028fc80, 0xc00028fc80, 0xc00014a210)
/go/src/github.com/kubeflow/pytorch-operator/pkg/controller.v1/pytorch/controller.go:434 +0x1265
github.com/kubeflow/pytorch-operator/pkg/controller.v1/pytorch.(*PyTorchController).syncPyTorchJob(0xc000149040, 0xc00014a200, 0x39, 0x0, 0x0, 0x0)
/go/src/github.com/kubeflow/pytorch-operator/pkg/controller.v1/pytorch/controller.go:324 +0x4a2
github.com/kubeflow/pytorch-operator/pkg/controller.v1/pytorch.(*PyTorchController).processNextWorkItem(0xc000149040, 0x0)
/go/src/github.com/kubeflow/pytorch-operator/pkg/controller.v1/pytorch/controller.go:262 +0x55f
github.com/kubeflow/pytorch-operator/pkg/controller.v1/pytorch.(*PyTorchController).runWorker(0xc000149040)
/go/src/github.com/kubeflow/pytorch-operator/pkg/controller.v1/pytorch/controller.go:216 +0x2b
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc001c09c70)
/go/pkg/mod/k8s.io/apimachinery@v0.15.10-beta.0/pkg/util/wait/wait.go:152 +0x5e
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc001c09c70, 0x3b9aca00, 0x0, 0x1, 0xc0000c2180)
/go/pkg/mod/k8s.io/apimachinery@v0.15.10-beta.0/pkg/util/wait/wait.go:153 +0xf8
k8s.io/apimachinery/pkg/util/wait.Until(0xc001c09c70, 0x3b9aca00, 0xc0000c2180)
/go/pkg/mod/k8s.io/apimachinery@v0.15.10-beta.0/pkg/util/wait/wait.go:88 +0x4d
created by github.com/kubeflow/pytorch-operator/pkg/controller.v1/pytorch.(*PyTorchController).Run
/go/src/github.com/kubeflow/pytorch-operator/pkg/controller.v1/pytorch/controller.go:202 +0x2c4
As far as I can tell, this only happens if I include all three of ttlSecondsAfterFinished: -1
, activeDeadlineSeconds: -1
and cleanPodPolicy: Running
. I'm not sure if the -1's are valid inputs, but either way I was surprised that it caused a crash in the operator rather than a rejection of the spec
gaocegege commented
I think it is related to kubeflow/training-operator#1223