[BUG] sharedMPIJob's kind and apiVersion are empty
Closed this issue · 4 comments
What happened:
When I run a mpijob using the latest kubedl docker image. I get the following err log.
time="2021-09-22T14:47:37Z" level=info msg="Reconciling for job mpi-demo"
time="2021-09-22T14:47:37Z" level=info msg="gang schedule enabled, start to syncing for job training-job/mpi-demo"
time="2021-09-22T14:47:37Z" level=error msg="failed to create gang schedule entity, gang scheduler: volcano, err: PodGroup.scheduling.volcano.sh \"mpi-demo\" is invalid: [metadata.ownerReferences.apiVersion: Invalid value: \"\": version must not be empty, metadata.ownerReferences.kind: Invalid value: \"\": kind must not be empty]"
2021-09-22T14:47:37.736Z ERROR mpi-controller mpi job reconcile failed {"error": "PodGroup.scheduling.volcano.sh \"mpi-demo\" is invalid: [metadata.ownerReferences.apiVersion: Invalid value: \"\": version must not be empty, metadata.ownerReferences.kind: Invalid value: \"\": kind must not be empty]"}
github.com/go-logr/zapr.(*zapLogger).Error
/go/pkg/mod/github.com/go-logr/zapr@v0.4.1-0.20210423233217-9f3e0b1ce51b/zapr.go:132
github.com/alibaba/kubedl/controllers/mpi.(*MPIJobReconciler).Reconcile
/workspace/controllers/mpi/mpijob_controller.go:156
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.5/pkg/internal/controller/controller.go:244
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.5/pkg/internal/controller/controller.go:218
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.5/pkg/internal/controller/controller.go:197
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
/go/pkg/mod/k8s.io/apimachinery@v0.20.7/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
/go/pkg/mod/k8s.io/apimachinery@v0.20.7/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
/go/pkg/mod/k8s.io/apimachinery@v0.20.7/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.Until
/go/pkg/mod/k8s.io/apimachinery@v0.20.7/pkg/util/wait/wait.go:90
I debugged and found that apiVersion and kind are empty here leading to this err log.
kubedl/pkg/gang_schedule/volcano_scheduler/scheduler.go
Lines 58 to 59 in 8e231e8
And I finally found that the core reason cause apiVersion and kind are empty is here, the sharedMPIJob in the latest commit does not have the kind and apiVersion values . And You change this code comparing to v0.3.0, but the sharedMPIJob in v0.3.0 has the kind and apiVersion values. If I change here to v0.3.0, this problem will miss.
I want to know why you change this code, what is it function?
(a) latest
kubedl/controllers/mpi/mpijob_controller.go
Line 114 in 8e231e8
(b) v0.3.0
kubedl/controllers/mpi/mpijob_controller.go
Line 111 in d77aef3
What you expected to happen:
no err log
How to reproduce it:
using the latest kubedl docker image and enable GangScheduling(volcano) to run a mpijob
Anything else we need to know?:
I just focus on mpijob, but other workload's operators use the same code like here. So other operators should have the same problem.
kubedl/controllers/mpi/mpijob_controller.go
Line 114 in 8e231e8
Environment:
- KubeDL version:
- Kubernetes version (use
kubectl version
): - OS (e.g:
cat /etc/os-release
): - Kernel (e.g.
uname -a
): - Install tools:
- Others:
@HeGaoYuan hi GaoYuan, GetObjectByPassCache
extract ClientReader
from client interface, since ClientReader
reads directly from apiserver bypass local informer cache to avoid getting stale data, and here is issue we recorded: #107
anyway, it reproduced in our cluster, may we inject TypeMeta in defaulter functions as a preventive solution?
func SetDefaults_MPIJob(mpiJob *MPIJob) {
if mpiJob.Kind == "" {
mpiJob.Kind = Kind
}
if mpiJob.APIVersion == "" {
mpiJob.APIVersion = SchemeGroupVersion.String()
}
// ...
}
anyway, it reproduced in our cluster, may we inject TypeMeta in defaulter functions as a preventive solution?
func SetDefaults_MPIJob(mpiJob *MPIJob) { if mpiJob.Kind == "" { mpiJob.Kind = Kind } if mpiJob.APIVersion == "" { mpiJob.APIVersion = SchemeGroupVersion.String() } // ... }
Yes, I am also trying to use some method like this to solve the problem. But I found that not only the missing kind and apiVersion problems when using GetObjectByPassCache, but also other problems like there is an err when calling jc.Controller.UpdateJobStatusInApiServer(job, &jobStatus)
MPIJob.training.kubedl.io \"mpi-demo\" is invalid: status.conditions: Invalid value: \"null\": status.conditions in body must be of type array: \"null\"
Maybe we will spend some time to fix this problem.
anyway, it reproduced in our cluster, may we inject TypeMeta in defaulter functions as a preventive solution?
func SetDefaults_MPIJob(mpiJob *MPIJob) { if mpiJob.Kind == "" { mpiJob.Kind = Kind } if mpiJob.APIVersion == "" { mpiJob.APIVersion = SchemeGroupVersion.String() } // ... }Yes, I am also trying to use some method like this to solve the problem. But I found that not only the missing kind and apiVersion problems when using GetObjectByPassCache, but also other problems like there is an err when calling jc.Controller.UpdateJobStatusInApiServer(job, &jobStatus)
MPIJob.training.kubedl.io \"mpi-demo\" is invalid: status.conditions: Invalid value: \"null\": status.conditions in body must be of type array: \"null\"
Maybe we will spend some time to fix this problem.
It occurs because status.condition
is an required field, it recovers after first worker running, however it can be a omitempty
field.