kubedl-io/kubedl

[BUG] sharedMPIJob's kind and apiVersion are empty

Closed this issue · 4 comments

What happened:

When I run a mpijob using the latest kubedl docker image. I get the following err log.

time="2021-09-22T14:47:37Z" level=info msg="Reconciling for job mpi-demo"
time="2021-09-22T14:47:37Z" level=info msg="gang schedule enabled, start to syncing for job training-job/mpi-demo"
time="2021-09-22T14:47:37Z" level=error msg="failed to create gang schedule entity, gang scheduler: volcano, err: PodGroup.scheduling.volcano.sh \"mpi-demo\" is invalid: [metadata.ownerReferences.apiVersion: Invalid value: \"\": version must not be empty, metadata.ownerReferences.kind: Invalid value: \"\": kind must not be empty]"
2021-09-22T14:47:37.736Z        ERROR   mpi-controller  mpi job reconcile failed        {"error": "PodGroup.scheduling.volcano.sh \"mpi-demo\" is invalid: [metadata.ownerReferences.apiVersion: Invalid value: \"\": version must not be empty, metadata.ownerReferences.kind: Invalid value: \"\": kind must not be empty]"}
github.com/go-logr/zapr.(*zapLogger).Error
        /go/pkg/mod/github.com/go-logr/zapr@v0.4.1-0.20210423233217-9f3e0b1ce51b/zapr.go:132
github.com/alibaba/kubedl/controllers/mpi.(*MPIJobReconciler).Reconcile
        /workspace/controllers/mpi/mpijob_controller.go:156
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.5/pkg/internal/controller/controller.go:244
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.5/pkg/internal/controller/controller.go:218
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.5/pkg/internal/controller/controller.go:197
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
        /go/pkg/mod/k8s.io/apimachinery@v0.20.7/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
        /go/pkg/mod/k8s.io/apimachinery@v0.20.7/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
        /go/pkg/mod/k8s.io/apimachinery@v0.20.7/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.Until
        /go/pkg/mod/k8s.io/apimachinery@v0.20.7/pkg/util/wait/wait.go:90

I debugged and found that apiVersion and kind are empty here leading to this err log.

apiVersion := accessor.GetAPIVersion()
kind := accessor.GetKind()

And I finally found that the core reason cause apiVersion and kind are empty is here, the sharedMPIJob in the latest commit does not have the kind and apiVersion values . And You change this code comparing to v0.3.0, but the sharedMPIJob in v0.3.0 has the kind and apiVersion values. If I change here to v0.3.0, this problem will miss.

I want to know why you change this code, what is it function?

(a) latest

err = util.GetObjectByPassCache(r.Client, req.NamespacedName, sharedMPIJob)

(b) v0.3.0

err = r.Get(context.Background(), req.NamespacedName, sharedMPIJob)

What you expected to happen:
no err log

How to reproduce it:
using the latest kubedl docker image and enable GangScheduling(volcano) to run a mpijob

Anything else we need to know?:
I just focus on mpijob, but other workload's operators use the same code like here. So other operators should have the same problem.

err = util.GetObjectByPassCache(r.Client, req.NamespacedName, sharedMPIJob)

Environment:

  • KubeDL version:
  • Kubernetes version (use kubectl version):
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

@HeGaoYuan hi GaoYuan, GetObjectByPassCache extract ClientReader from client interface, since ClientReader reads directly from apiserver bypass local informer cache to avoid getting stale data, and here is issue we recorded: #107

anyway, it reproduced in our cluster, may we inject TypeMeta in defaulter functions as a preventive solution?

func SetDefaults_MPIJob(mpiJob *MPIJob) {
  if mpiJob.Kind == "" {
		mpiJob.Kind = Kind
	}
	if mpiJob.APIVersion == "" {
		mpiJob.APIVersion = SchemeGroupVersion.String()
	}
       // ...
}

anyway, it reproduced in our cluster, may we inject TypeMeta in defaulter functions as a preventive solution?

func SetDefaults_MPIJob(mpiJob *MPIJob) {
  if mpiJob.Kind == "" {
		mpiJob.Kind = Kind
	}
	if mpiJob.APIVersion == "" {
		mpiJob.APIVersion = SchemeGroupVersion.String()
	}
       // ...
}

Yes, I am also trying to use some method like this to solve the problem. But I found that not only the missing kind and apiVersion problems when using GetObjectByPassCache, but also other problems like there is an err when calling jc.Controller.UpdateJobStatusInApiServer(job, &jobStatus)

MPIJob.training.kubedl.io \"mpi-demo\" is invalid: status.conditions: Invalid value: \"null\": status.conditions in body must be of type array: \"null\"

Maybe we will spend some time to fix this problem.

anyway, it reproduced in our cluster, may we inject TypeMeta in defaulter functions as a preventive solution?

func SetDefaults_MPIJob(mpiJob *MPIJob) {
  if mpiJob.Kind == "" {
		mpiJob.Kind = Kind
	}
	if mpiJob.APIVersion == "" {
		mpiJob.APIVersion = SchemeGroupVersion.String()
	}
       // ...
}

Yes, I am also trying to use some method like this to solve the problem. But I found that not only the missing kind and apiVersion problems when using GetObjectByPassCache, but also other problems like there is an err when calling jc.Controller.UpdateJobStatusInApiServer(job, &jobStatus)

MPIJob.training.kubedl.io \"mpi-demo\" is invalid: status.conditions: Invalid value: \"null\": status.conditions in body must be of type array: \"null\"

Maybe we will spend some time to fix this problem.

It occurs because status.condition is an required field, it recovers after first worker running, however it can be a omitempty field.