Fail to migrate pod

sorry to disturb,but i have some trouble when i migrate the pod.

this is the environment:

and the pod's yaml file is:

podyaml

apiVersion: v1
kind: Pod
metadata:
name: netserver-pod
labels:
app: netserver-pod
spec:
containers:

name: netserver-container
image: zhanglongyao/netserver:v1
ports:
- containerPort: 12345
  protocol: UDP
  volumeMounts:
- name: nfs-volume
  mountPath: /app
  volumes:
name: nfs-volume
hostPath:
path: /nfs/data/01
the path is the nfs shared with master and nodes,and with a service via it,its yaml file like:

serviceyaml

apiVersion: v1
kind: Service
metadata:
name: netserver-service
spec:
type: NodePort # 更改服务类型为 NodePort
selector:
app: netserver-pod
ports:

protocol: UDP
port: 12345
targetPort: 12345
nodePort: 30123 # 选择一个未被使用的端口作为 NodePort
but when i run:kubectl migrate netserver-pod agent1(another node),it results:

problem

and i run kubectl describe pod netserver-pod-migration-33,it shows:

what does it means"failed to start containerd task "15f034b308e847e000cb26288e4cf1c875606a3d388dbea7b6c62396d476e784": OCI runtime restore failed: criu failed: type NOTIFY errno 0
log file: /var/lib/containerd/io.containerd.runtime.v1.linux/k8s.io/15f034b308e847e000cb26288e4cf1c875606a3d388dbea7b6c62396d476e784/restore.log: unknown",and how to solve it?please help me ,thank you very much! @vutuong

@120L020314 Could you please give me more logs:

What is logs generated from podmigration_controller?
Please give me the capture of the new created pod ( restored pod) at the target node Ex: "kubectl edit pod netserver-pod-migration-33"
Please open a new tab and run "watch ls /var/lib/kubelet/migration/kkk", monitor it when you run kubectl migrate
Please make sure nfs shared file is sync between two node.

podmigration_controller logs

capture of the new created pod

watch ls /var/lib/kubelet/migration/kkk

shared file sync

logs of pod

thank you for your answer,please teach me how to solve this problem, thank you very much !
@vutuong

@120L020314 Ah sorry, please provide with the data from the folder /var/lib/kubelet/migration/kkk/netserver during the migration process. Verify whether it matches the data generated by the kubectl checkpoint command.

the data from the folder /var/lib/kubelet/migration/kkk/netserver(migration create)

the data from the command ：kubectl checkpoint netserver-pod /var/lib/kubelet/migration/kkk/netserverck

i try to find difference between the checkpoint when kubectl checkpoint and kubectl migrate,but they are the same.i don't know why migrate failure? thank you @vutuong

or does my image result in the failure when restore with criu? @vutuong
but i try to use docker + criu to restore that container successfully.

@120L020314 To ensure that is it problem with criu restore, please check the log from that command:
kubectl describe pod netserver-pod-migration-33
It points to the log at /var/lib/containerd/.... Please capture it here.
Also please give the full yaml file of new created pod.
In the other side, Could you please change the code of controller at: controllers/podmigration_controller.go
Uncomment the sleep timer at line 155,156:
log.Info("", "Live-migration", "Step 3 - Wait until checkpoint info are created - completed")
// time.Sleep(10)

podmigration-operator/controllers/podmigration_controller.go

Line 156 in 2c421a0

// time.Sleep(10)

Then, please try to increase the sleep time to 500 (second) to make sure that all the checkpoint data is save to the folder and sync between two node. Rerun the controller after modify the source code then retest.

the results of the command:kubectl describe pod netserver-pod-migration-XX

Name: netserver-pod-migration-7
Namespace: default
Priority: 0
Node: agent2/192.168.31.49
Start Time: Wed, 20 Mar 2024 13:50:20 +0800
Labels: app=netserver-pod
Annotations: snapshotPath: /var/lib/kubelet/migration/kkk/netserver
snapshotPolicy: restore
sourcePod: netserver-pod
Status: Running
IP: 10.244.2.73
IPs:
IP: 10.244.2.73
Controlled By: Podmigration/netserver-pod-migration-controller-40
Containers:
netserver-container:
Container ID: containerd://ccfdec5ba99103ebf04af8fc957b59861379f708780570799eddfe0b09d3b1dc
Image: zhanglongyao/netserver:v1
Image ID: docker.io/zhanglongyao/netserver@sha256:dc4c32a455518ad5138fc690511f96481861c271a928433eee2aac9dc9d09c73
Port: 12345/UDP
Host Port: 0/UDP
State: Terminated
Reason: StartError
Message: failed to start containerd task "ccfdec5ba99103ebf04af8fc957b59861379f708780570799eddfe0b09d3b1dc": OCI runtime restore failed: criu failed: type NOTIFY errno 0
log file: /var/lib/containerd/io.containerd.runtime.v1.linux/k8s.io/ccfdec5ba99103ebf04af8fc957b59861379f708780570799eddfe0b09d3b1dc/restore.log: unknown
Exit Code: 128
Started: Thu, 01 Jan 1970 08:00:00 +0800
Finished: Wed, 20 Mar 2024 13:50:46 +0800
Last State: Terminated
Reason: StartError
Message: failed to start containerd task "97ed88f3d49f516139b63eaca05e8f8ee2892dffaf810f12c78c106f57a9115f": OCI runtime restore failed: criu failed: type NOTIFY errno 0
log file: /var/lib/containerd/io.containerd.runtime.v1.linux/k8s.io/97ed88f3d49f516139b63eaca05e8f8ee2892dffaf810f12c78c106f57a9115f/restore.log: unknown
Exit Code: 128
Started: Thu, 01 Jan 1970 08:00:00 +0800
Finished: Wed, 20 Mar 2024 13:50:45 +0800
Ready: False
Restart Count: 25
Environment:
Mounts:
/app from nfs-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-68rlh (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
nfs-volume:
Type: HostPath (bare host directory volume)
Path: /nfs/data/01
HostPathType:
default-token-68rlh:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-68rlh
Optional: false
QoS Class: BestEffort
Node-Selectors: kubernetes.io/hostname=agent2
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message

Normal Scheduled 27s default-scheduler Successfully assigned default/netserver-pod-migration-7 to agent2
Normal Created 19s (x8 over 26s) kubelet, agent2 Created container netserver-container
Normal Started 19s (x8 over 26s) kubelet, agent2 Restored container netserver-container from checkpoint /var/lib/kubelet/migration/kkk/netserver/netserver-container
Normal Pulled 18s (x9 over 26s) kubelet, agent2 Container image "zhanglongyao/netserver:v1" already present on machine
it seems get fault at /var/lib/containerd/io.containerd.runtime.v1.linux/k8s.io/97ed88f3d49f516139b63eaca05e8f8ee2892dffaf810f12c78c106f57a9115f/restore.log,then i capture the path

path:/var/lib/containerd/io.containerd.runtime.v1.linux/k8s.io/97ed88f3d49f516139b63eaca05e8f8ee2892dffaf810f12c78c106f57a9115f/restore.log

i can not find the fils:97ed88f3d49f516139b63eaca05e8f8ee2892dffaf810f12c78c106f57a9115f,i think the fault is here,but idon't know why ang how to solve it?

the full yaml file of new created pod

apiVersion: v1
kind: Pod
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"labels":{"app":"netserver-pod"},"name":"netserver-pod","namespace":"default"},"spec":{"containers":[{"image":"zhanglongyao/netserver:v1","name":"netserver-container","ports":[{"containerPort":12345,"protocol":"UDP"}],"volumeMounts":[{"mountPath":"/app","name":"nfs-volume"}]}],"nodeSelector":{"kubernetes.io/hostname":"agent1"},"volumes":[{"hostPath":{"path":"/nfs/data/01"},"name":"nfs-volume"}]}}
snapshotPath: /var/lib/kubelet/migration/kkk/netserver
snapshotPolicy: restore
sourcePod: netserver-pod
creationTimestamp: "2024-03-20T05:50:20Z"
generateName: netserver-pod-migration-controller-40-
labels:
app: netserver-pod
managedFields:

apiVersion: v1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.: {}
f:kubectl.kubernetes.io/last-applied-configuration: {}
f:snapshotPath: {}
f:snapshotPolicy: {}
f:sourcePod: {}
f:generateName: {}
f:labels:
.: {}
f:app: {}
f:ownerReferences:
.: {}
k:{"uid":"a370d37b-0b6e-4cba-bc03-eb6af90a3944"}:
.: {}
f:apiVersion: {}
f:blockOwnerDeletion: {}
f:controller: {}
f:kind: {}
f:name: {}
f:uid: {}
f:spec:
f:containers:
k:{"name":"netserver-container"}:
.: {}
f:image: {}
f:imagePullPolicy: {}
f:name: {}
f:ports:
.: {}
k:{"containerPort":12345,"protocol":"UDP"}:
.: {}
f:containerPort: {}
f:protocol: {}
f:resources: {}
f:terminationMessagePath: {}
f:terminationMessagePolicy: {}
f:volumeMounts:
.: {}
k:{"mountPath":"/app"}:
.: {}
f:mountPath: {}
f:name: {}
k:{"mountPath":"/var/run/secrets/kubernetes.io/serviceaccount"}:
.: {}
f:mountPath: {}
f:name: {}
f:readOnly: {}
f:dnsPolicy: {}
f:enableServiceLinks: {}
f:nodeSelector:
.: {}
f:kubernetes.io/hostname: {}
f:restartPolicy: {}
f:schedulerName: {}
f:securityContext: {}
f:terminationGracePeriodSeconds: {}
f:volumes:
.: {}
k:{"name":"default-token-68rlh"}:
.: {}
f:name: {}
f:secret:
.: {}
f:defaultMode: {}
f:secretName: {}
k:{"name":"nfs-volume"}:
.: {}
f:hostPath:
.: {}
f:path: {}
f:type: {}
f:name: {}
manager: main
operation: Update
time: "2024-03-20T05:50:20Z"
apiVersion: v1
fieldsType: FieldsV1
fieldsV1:
f:status:
f:conditions:
k:{"type":"ContainersReady"}:
.: {}
f:lastProbeTime: {}
f:lastTransitionTime: {}
f:message: {}
f:reason: {}
f:status: {}
f:type: {}
k:{"type":"Initialized"}:
.: {}
f:lastProbeTime: {}
f:lastTransitionTime: {}
f:status: {}
f:type: {}
k:{"type":"Ready"}:
.: {}
f:lastProbeTime: {}
f:lastTransitionTime: {}
f:message: {}
f:reason: {}
f:status: {}
f:type: {}
f:containerStatuses: {}
f:hostIP: {}
f:phase: {}
f:podIP: {}
f:podIPs:
.: {}
k:{"ip":"10.244.2.73"}:
.: {}
f:ip: {}
f:startTime: {}
manager: kubelet
operation: Update
time: "2024-03-20T05:50:22Z"
name: netserver-pod-migration-7
namespace: default
ownerReferences:
apiVersion: podmig.dcn.ssu.ac.kr/v1
blockOwnerDeletion: true
controller: true
kind: Podmigration
name: netserver-pod-migration-controller-40
uid: a370d37b-0b6e-4cba-bc03-eb6af90a3944
resourceVersion: "336198"
selfLink: /api/v1/namespaces/default/pods/netserver-pod-migration-7
uid: 5421fed8-7ffc-4276-b605-a4e2257a759b
spec:
containers:
image: zhanglongyao/netserver:v1
imagePullPolicy: IfNotPresent
name: netserver-container
ports:
- containerPort: 12345
  protocol: UDP
  resources: {}
  terminationMessagePath: /dev/termination-log
  terminationMessagePolicy: File
  volumeMounts:
- mountPath: /app
  name: nfs-volume
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
  name: default-token-68rlh
  readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: agent2
  nodeSelector:
  kubernetes.io/hostname: agent2
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
hostPath:
path: /nfs/data/01
type: ""
name: nfs-volume
name: default-token-68rlh
secret:
defaultMode: 420
secretName: default-token-68rlh
status:
conditions:
lastProbeTime: null
lastTransitionTime: "2024-03-20T05:50:20Z"
status: "True"
type: Initialized
lastProbeTime: null
lastTransitionTime: "2024-03-20T05:50:20Z"
message: 'containers with unready status: [netserver-container]'
reason: ContainersNotReady
status: "False"
type: Ready
lastProbeTime: null
lastTransitionTime: "2024-03-20T05:50:20Z"
message: 'containers with unready status: [netserver-container]'
reason: ContainersNotReady
status: "False"
type: ContainersReady
lastProbeTime: null
lastTransitionTime: "2024-03-20T05:50:20Z"
status: "True"
type: PodScheduled
containerStatuses:
containerID: containerd://6bd1f6d89c84f75a4b516d12ffc1282b89422c569dd3c88cc1e8ef199a46d1ba
image: docker.io/zhanglongyao/netserver:v1
imageID: docker.io/zhanglongyao/netserver@sha256:dc4c32a455518ad5138fc690511f96481861c271a928433eee2aac9dc9d09c73
lastState:
terminated:
containerID: containerd://6bd1f6d89c84f75a4b516d12ffc1282b89422c569dd3c88cc1e8ef199a46d1ba
exitCode: 128
finishedAt: "2024-03-20T05:50:47Z"
message: |-
failed to start containerd task "6bd1f6d89c84f75a4b516d12ffc1282b89422c569dd3c88cc1e8ef199a46d1ba": OCI runtime restore failed: criu failed: type NOTIFY errno 0
log file: /var/lib/containerd/io.containerd.runtime.v1.linux/k8s.io/6bd1f6d89c84f75a4b516d12ffc1282b89422c569dd3c88cc1e8ef199a46d1ba/restore.log: unknown
reason: StartError
startedAt: "1970-01-01T00:00:00Z"
name: netserver-container
ready: false
restartCount: 26
started: false
state:
waiting:
message: 'failed to reserve container name "netserver-container_netserver-pod-migration-7_default_5421fed8-7ffc-4276-b605-a4e2257a759b_27":
name "netserver-container_netserver-pod-migration-7_default_5421fed8-7ffc-4276-b605-a4e2257a759b_27"
is reserved for "16b5e46878c1407dd7b4ddc6dc6b81a149a45e9ebf0c5a49aecf85941163e2e8"'
reason: CreateContainerError
hostIP: 192.168.31.49
phase: Running
podIP: 10.244.2.73
podIPs:
ip: 10.244.2.73
qosClass: BestEffort
startTime: "2024-03-20T05:50:20Z"

and i change the source code to "sleep(500)",it doesn't work. @vutuong sorry to disturb.

@120L020314 You are wellcome. Maybe it really is bug here.
One question, Is this setup work well with other app and not work with your app ?
An one more step to check problem. I write an example of yaml file to deploy a new pod from checkpoint data here:
https://github.com/SSU-DCN/podmigration-operator/blob/main/config/samples/podmig_v1_restore.yaml
Could you please try to modify it based on your own pod manifest yaml file? Then apply it ?

oh，sorry，i goto agent2 node and find the restore.log，it's information like this：
(00.070390) 1: Error (criu/files-reg.c:1831): Can't open file tmp/hsperfdata_root/1 on restore: No such file or directory
(00.070395) 1: Error (criu/files-reg.c:1767): Can't open file tmp/hsperfdata_root/1: No such file or directory
(00.070397) 1: Error (criu/mem.c:1383): `- Can't open vma
(00.103003) mnt: Switching to new ns to clean ghosts
(00.103362) Error (criu/cr-restore.c:2397): Restoring FAILED.
i think this is the reason why restore failed。but i don‘t know what are those files，do you know about that？ thank you very much. @vutuong

Thank you for your prompt and patient response!i solve my problem by change my image， the problem is that criu can not checkpoint the file tmp/hsperfdata_root/1,so i change my image to not create that file.(it doesn't matter) and i try to migrate,that works.thank you for your teaching,i learned how to read the logs of containerd and criu's restore.log!

@vutuong