Fail to migrate pod
Closed this issue · 10 comments
sorry to disturb,but i have some trouble when i migrate the pod.
this is the environment:
podyaml
apiVersion: v1
kind: Pod
metadata:
name: netserver-pod
labels:
app: netserver-pod
spec:
containers:
- name: netserver-container
image: zhanglongyao/netserver:v1
ports:- containerPort: 12345
protocol: UDP
volumeMounts: - name: nfs-volume
mountPath: /app
volumes:
- containerPort: 12345
- name: nfs-volume
hostPath:
path: /nfs/data/01
the path is the nfs shared with master and nodes,and with a service via it,its yaml file like:
serviceyaml
apiVersion: v1
kind: Service
metadata:
name: netserver-service
spec:
type: NodePort # 更改服务类型为 NodePort
selector:
app: netserver-pod
ports:
- protocol: UDP
port: 12345
targetPort: 12345
nodePort: 30123 # 选择一个未被使用的端口作为 NodePort
but when i run:kubectl migrate netserver-pod agent1(another node),it results:
problem
and i run kubectl describe pod netserver-pod-migration-33,it shows:
what does it means"failed to start containerd task "15f034b308e847e000cb26288e4cf1c875606a3d388dbea7b6c62396d476e784": OCI runtime restore failed: criu failed: type NOTIFY errno 0
log file: /var/lib/containerd/io.containerd.runtime.v1.linux/k8s.io/15f034b308e847e000cb26288e4cf1c875606a3d388dbea7b6c62396d476e784/restore.log: unknown",and how to solve it?please help me ,thank you very much! @vutuong
@120L020314 Could you please give me more logs:
- What is logs generated from podmigration_controller?
- Please give me the capture of the new created pod ( restored pod) at the target node Ex: "kubectl edit pod netserver-pod-migration-33"
- Please open a new tab and run "watch ls /var/lib/kubelet/migration/kkk", monitor it when you run kubectl migrate
- Please make sure nfs shared file is sync between two node.
podmigration_controller logs
capture of the new created pod
watch ls /var/lib/kubelet/migration/kkk
shared file sync
logs of pod
thank you for your answer,please teach me how to solve this problem, thank you very much !
@vutuong
@120L020314 Ah sorry, please provide with the data from the folder /var/lib/kubelet/migration/kkk/netserver
during the migration process. Verify whether it matches the data generated by the kubectl checkpoint
command.
the data from the folder /var/lib/kubelet/migration/kkk/netserver(migration create)
the data from the command :kubectl checkpoint netserver-pod /var/lib/kubelet/migration/kkk/netserverck
i try to find difference between the checkpoint when kubectl checkpoint and kubectl migrate,but they are the same.i don't know why migrate failure? thank you @vutuong
or does my image result in the failure when restore with criu? @vutuong
but i try to use docker + criu to restore that container successfully.
@120L020314 To ensure that is it problem with criu restore, please check the log from that command:
kubectl describe pod netserver-pod-migration-33
It points to the log at /var/lib/containerd/...
. Please capture it here.
Also please give the full yaml file of new created pod.
In the other side, Could you please change the code of controller at: controllers/podmigration_controller.go
Uncomment the sleep timer at line 155,156:
log.Info("", "Live-migration", "Step 3 - Wait until checkpoint info are created - completed")
// time.Sleep(10)
Then, please try to increase the sleep time to 500 (second) to make sure that all the checkpoint data is save to the folder and sync between two node. Rerun the controller after modify the source code then retest.
the results of the command:kubectl describe pod netserver-pod-migration-XX
Name: netserver-pod-migration-7
Namespace: default
Priority: 0
Node: agent2/192.168.31.49
Start Time: Wed, 20 Mar 2024 13:50:20 +0800
Labels: app=netserver-pod
Annotations: snapshotPath: /var/lib/kubelet/migration/kkk/netserver
snapshotPolicy: restore
sourcePod: netserver-pod
Status: Running
IP: 10.244.2.73
IPs:
IP: 10.244.2.73
Controlled By: Podmigration/netserver-pod-migration-controller-40
Containers:
netserver-container:
Container ID: containerd://ccfdec5ba99103ebf04af8fc957b59861379f708780570799eddfe0b09d3b1dc
Image: zhanglongyao/netserver:v1
Image ID: docker.io/zhanglongyao/netserver@sha256:dc4c32a455518ad5138fc690511f96481861c271a928433eee2aac9dc9d09c73
Port: 12345/UDP
Host Port: 0/UDP
State: Terminated
Reason: StartError
Message: failed to start containerd task "ccfdec5ba99103ebf04af8fc957b59861379f708780570799eddfe0b09d3b1dc": OCI runtime restore failed: criu failed: type NOTIFY errno 0
log file: /var/lib/containerd/io.containerd.runtime.v1.linux/k8s.io/ccfdec5ba99103ebf04af8fc957b59861379f708780570799eddfe0b09d3b1dc/restore.log: unknown
Exit Code: 128
Started: Thu, 01 Jan 1970 08:00:00 +0800
Finished: Wed, 20 Mar 2024 13:50:46 +0800
Last State: Terminated
Reason: StartError
Message: failed to start containerd task "97ed88f3d49f516139b63eaca05e8f8ee2892dffaf810f12c78c106f57a9115f": OCI runtime restore failed: criu failed: type NOTIFY errno 0
log file: /var/lib/containerd/io.containerd.runtime.v1.linux/k8s.io/97ed88f3d49f516139b63eaca05e8f8ee2892dffaf810f12c78c106f57a9115f/restore.log: unknown
Exit Code: 128
Started: Thu, 01 Jan 1970 08:00:00 +0800
Finished: Wed, 20 Mar 2024 13:50:45 +0800
Ready: False
Restart Count: 25
Environment:
Mounts:
/app from nfs-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-68rlh (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
nfs-volume:
Type: HostPath (bare host directory volume)
Path: /nfs/data/01
HostPathType:
default-token-68rlh:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-68rlh
Optional: false
QoS Class: BestEffort
Node-Selectors: kubernetes.io/hostname=agent2
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
Normal Scheduled 27s default-scheduler Successfully assigned default/netserver-pod-migration-7 to agent2
Normal Created 19s (x8 over 26s) kubelet, agent2 Created container netserver-container
Normal Started 19s (x8 over 26s) kubelet, agent2 Restored container netserver-container from checkpoint /var/lib/kubelet/migration/kkk/netserver/netserver-container
Normal Pulled 18s (x9 over 26s) kubelet, agent2 Container image "zhanglongyao/netserver:v1" already present on machine
it seems get fault at /var/lib/containerd/io.containerd.runtime.v1.linux/k8s.io/97ed88f3d49f516139b63eaca05e8f8ee2892dffaf810f12c78c106f57a9115f/restore.log,then i capture the path
path:/var/lib/containerd/io.containerd.runtime.v1.linux/k8s.io/97ed88f3d49f516139b63eaca05e8f8ee2892dffaf810f12c78c106f57a9115f/restore.log
i can not find the fils:97ed88f3d49f516139b63eaca05e8f8ee2892dffaf810f12c78c106f57a9115f,i think the fault is here,but idon't know why ang how to solve it?
the full yaml file of new created pod
apiVersion: v1
kind: Pod
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"labels":{"app":"netserver-pod"},"name":"netserver-pod","namespace":"default"},"spec":{"containers":[{"image":"zhanglongyao/netserver:v1","name":"netserver-container","ports":[{"containerPort":12345,"protocol":"UDP"}],"volumeMounts":[{"mountPath":"/app","name":"nfs-volume"}]}],"nodeSelector":{"kubernetes.io/hostname":"agent1"},"volumes":[{"hostPath":{"path":"/nfs/data/01"},"name":"nfs-volume"}]}}
snapshotPath: /var/lib/kubelet/migration/kkk/netserver
snapshotPolicy: restore
sourcePod: netserver-pod
creationTimestamp: "2024-03-20T05:50:20Z"
generateName: netserver-pod-migration-controller-40-
labels:
app: netserver-pod
managedFields:
- apiVersion: v1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.: {}
f:kubectl.kubernetes.io/last-applied-configuration: {}
f:snapshotPath: {}
f:snapshotPolicy: {}
f:sourcePod: {}
f:generateName: {}
f:labels:
.: {}
f:app: {}
f:ownerReferences:
.: {}
k:{"uid":"a370d37b-0b6e-4cba-bc03-eb6af90a3944"}:
.: {}
f:apiVersion: {}
f:blockOwnerDeletion: {}
f:controller: {}
f:kind: {}
f:name: {}
f:uid: {}
f:spec:
f:containers:
k:{"name":"netserver-container"}:
.: {}
f:image: {}
f:imagePullPolicy: {}
f:name: {}
f:ports:
.: {}
k:{"containerPort":12345,"protocol":"UDP"}:
.: {}
f:containerPort: {}
f:protocol: {}
f:resources: {}
f:terminationMessagePath: {}
f:terminationMessagePolicy: {}
f:volumeMounts:
.: {}
k:{"mountPath":"/app"}:
.: {}
f:mountPath: {}
f:name: {}
k:{"mountPath":"/var/run/secrets/kubernetes.io/serviceaccount"}:
.: {}
f:mountPath: {}
f:name: {}
f:readOnly: {}
f:dnsPolicy: {}
f:enableServiceLinks: {}
f:nodeSelector:
.: {}
f:kubernetes.io/hostname: {}
f:restartPolicy: {}
f:schedulerName: {}
f:securityContext: {}
f:terminationGracePeriodSeconds: {}
f:volumes:
.: {}
k:{"name":"default-token-68rlh"}:
.: {}
f:name: {}
f:secret:
.: {}
f:defaultMode: {}
f:secretName: {}
k:{"name":"nfs-volume"}:
.: {}
f:hostPath:
.: {}
f:path: {}
f:type: {}
f:name: {}
manager: main
operation: Update
time: "2024-03-20T05:50:20Z" - apiVersion: v1
fieldsType: FieldsV1
fieldsV1:
f:status:
f:conditions:
k:{"type":"ContainersReady"}:
.: {}
f:lastProbeTime: {}
f:lastTransitionTime: {}
f:message: {}
f:reason: {}
f:status: {}
f:type: {}
k:{"type":"Initialized"}:
.: {}
f:lastProbeTime: {}
f:lastTransitionTime: {}
f:status: {}
f:type: {}
k:{"type":"Ready"}:
.: {}
f:lastProbeTime: {}
f:lastTransitionTime: {}
f:message: {}
f:reason: {}
f:status: {}
f:type: {}
f:containerStatuses: {}
f:hostIP: {}
f:phase: {}
f:podIP: {}
f:podIPs:
.: {}
k:{"ip":"10.244.2.73"}:
.: {}
f:ip: {}
f:startTime: {}
manager: kubelet
operation: Update
time: "2024-03-20T05:50:22Z"
name: netserver-pod-migration-7
namespace: default
ownerReferences: - apiVersion: podmig.dcn.ssu.ac.kr/v1
blockOwnerDeletion: true
controller: true
kind: Podmigration
name: netserver-pod-migration-controller-40
uid: a370d37b-0b6e-4cba-bc03-eb6af90a3944
resourceVersion: "336198"
selfLink: /api/v1/namespaces/default/pods/netserver-pod-migration-7
uid: 5421fed8-7ffc-4276-b605-a4e2257a759b
spec:
containers: - image: zhanglongyao/netserver:v1
imagePullPolicy: IfNotPresent
name: netserver-container
ports:- containerPort: 12345
protocol: UDP
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts: - mountPath: /app
name: nfs-volume - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-68rlh
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
nodeName: agent2
nodeSelector:
kubernetes.io/hostname: agent2
preemptionPolicy: PreemptLowerPriority
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 30
tolerations:
- containerPort: 12345
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300 - effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes: - hostPath:
path: /nfs/data/01
type: ""
name: nfs-volume - name: default-token-68rlh
secret:
defaultMode: 420
secretName: default-token-68rlh
status:
conditions: - lastProbeTime: null
lastTransitionTime: "2024-03-20T05:50:20Z"
status: "True"
type: Initialized - lastProbeTime: null
lastTransitionTime: "2024-03-20T05:50:20Z"
message: 'containers with unready status: [netserver-container]'
reason: ContainersNotReady
status: "False"
type: Ready - lastProbeTime: null
lastTransitionTime: "2024-03-20T05:50:20Z"
message: 'containers with unready status: [netserver-container]'
reason: ContainersNotReady
status: "False"
type: ContainersReady - lastProbeTime: null
lastTransitionTime: "2024-03-20T05:50:20Z"
status: "True"
type: PodScheduled
containerStatuses: - containerID: containerd://6bd1f6d89c84f75a4b516d12ffc1282b89422c569dd3c88cc1e8ef199a46d1ba
image: docker.io/zhanglongyao/netserver:v1
imageID: docker.io/zhanglongyao/netserver@sha256:dc4c32a455518ad5138fc690511f96481861c271a928433eee2aac9dc9d09c73
lastState:
terminated:
containerID: containerd://6bd1f6d89c84f75a4b516d12ffc1282b89422c569dd3c88cc1e8ef199a46d1ba
exitCode: 128
finishedAt: "2024-03-20T05:50:47Z"
message: |-
failed to start containerd task "6bd1f6d89c84f75a4b516d12ffc1282b89422c569dd3c88cc1e8ef199a46d1ba": OCI runtime restore failed: criu failed: type NOTIFY errno 0
log file: /var/lib/containerd/io.containerd.runtime.v1.linux/k8s.io/6bd1f6d89c84f75a4b516d12ffc1282b89422c569dd3c88cc1e8ef199a46d1ba/restore.log: unknown
reason: StartError
startedAt: "1970-01-01T00:00:00Z"
name: netserver-container
ready: false
restartCount: 26
started: false
state:
waiting:
message: 'failed to reserve container name "netserver-container_netserver-pod-migration-7_default_5421fed8-7ffc-4276-b605-a4e2257a759b_27":
name "netserver-container_netserver-pod-migration-7_default_5421fed8-7ffc-4276-b605-a4e2257a759b_27"
is reserved for "16b5e46878c1407dd7b4ddc6dc6b81a149a45e9ebf0c5a49aecf85941163e2e8"'
reason: CreateContainerError
hostIP: 192.168.31.49
phase: Running
podIP: 10.244.2.73
podIPs: - ip: 10.244.2.73
qosClass: BestEffort
startTime: "2024-03-20T05:50:20Z"
and i change the source code to "sleep(500)",it doesn't work. @vutuong sorry to disturb.
@120L020314 You are wellcome. Maybe it really is bug here.
One question, Is this setup work well with other app and not work with your app ?
An one more step to check problem. I write an example of yaml file to deploy a new pod from checkpoint data here:
https://github.com/SSU-DCN/podmigration-operator/blob/main/config/samples/podmig_v1_restore.yaml
Could you please try to modify it based on your own pod manifest yaml file? Then apply it ?
oh,sorry,i goto agent2 node and find the restore.log,it's information like this:
(00.070390) 1: Error (criu/files-reg.c:1831): Can't open file tmp/hsperfdata_root/1 on restore: No such file or directory
(00.070395) 1: Error (criu/files-reg.c:1767): Can't open file tmp/hsperfdata_root/1: No such file or directory
(00.070397) 1: Error (criu/mem.c:1383): `- Can't open vma
(00.103003) mnt: Switching to new ns to clean ghosts
(00.103362) Error (criu/cr-restore.c:2397): Restoring FAILED.
i think this is the reason why restore failed。but i don‘t know what are those files,do you know about that? thank you very much. @vutuong
Thank you for your prompt and patient response!i solve my problem by change my image, the problem is that criu can not checkpoint the file tmp/hsperfdata_root/1,so i change my image to not create that file.(it doesn't matter) and i try to migrate,that works.thank you for your teaching,i learned how to read the logs of containerd and criu's restore.log!
@vutuong