SSU-DCN/podmigration-operator

Fail to migrate pod

Closed this issue · 10 comments

sorry to disturb,but i have some trouble when i migrate the pod.

this is the environment:

image
and the pod's yaml file is:

podyaml

apiVersion: v1
kind: Pod
metadata:
name: netserver-pod
labels:
app: netserver-pod
spec:
containers:

  • name: netserver-container
    image: zhanglongyao/netserver:v1
    ports:
    • containerPort: 12345
      protocol: UDP
      volumeMounts:
    • name: nfs-volume
      mountPath: /app
      volumes:
  • name: nfs-volume
    hostPath:
    path: /nfs/data/01
    the path is the nfs shared with master and nodes,and with a service via it,its yaml file like:

serviceyaml

apiVersion: v1
kind: Service
metadata:
name: netserver-service
spec:
type: NodePort # 更改服务类型为 NodePort
selector:
app: netserver-pod
ports:

  • protocol: UDP
    port: 12345
    targetPort: 12345
    nodePort: 30123 # 选择一个未被使用的端口作为 NodePort
    but when i run:kubectl migrate netserver-pod agent1(another node),it results:

problem

image
and i run kubectl describe pod netserver-pod-migration-33,it shows:
image
what does it means"failed to start containerd task "15f034b308e847e000cb26288e4cf1c875606a3d388dbea7b6c62396d476e784": OCI runtime restore failed: criu failed: type NOTIFY errno 0
log file: /var/lib/containerd/io.containerd.runtime.v1.linux/k8s.io/15f034b308e847e000cb26288e4cf1c875606a3d388dbea7b6c62396d476e784/restore.log: unknown",and how to solve it?please help me ,thank you very much! @vutuong

@120L020314 Could you please give me more logs:

  • What is logs generated from podmigration_controller?
  • Please give me the capture of the new created pod ( restored pod) at the target node Ex: "kubectl edit pod netserver-pod-migration-33"
  • Please open a new tab and run "watch ls /var/lib/kubelet/migration/kkk", monitor it when you run kubectl migrate
  • Please make sure nfs shared file is sync between two node.

podmigration_controller logs

image
image

capture of the new created pod

image
image
image

watch ls /var/lib/kubelet/migration/kkk

image

shared file sync

image
image

logs of pod

image

thank you for your answer,please teach me how to solve this problem, thank you very much !
@vutuong

@120L020314 Ah sorry, please provide with the data from the folder /var/lib/kubelet/migration/kkk/netserver during the migration process. Verify whether it matches the data generated by the kubectl checkpoint command.

the data from the folder /var/lib/kubelet/migration/kkk/netserver(migration create)

image

the data from the command :kubectl checkpoint netserver-pod /var/lib/kubelet/migration/kkk/netserverck

image

i try to find difference between the checkpoint when kubectl checkpoint and kubectl migrate,but they are the same.i don't know why migrate failure? thank you @vutuong

image
or does my image result in the failure when restore with criu? @vutuong
but i try to use docker + criu to restore that container successfully.

@120L020314 To ensure that is it problem with criu restore, please check the log from that command:
kubectl describe pod netserver-pod-migration-33
It points to the log at /var/lib/containerd/.... Please capture it here.
Also please give the full yaml file of new created pod.
In the other side, Could you please change the code of controller at: controllers/podmigration_controller.go
Uncomment the sleep timer at line 155,156:
log.Info("", "Live-migration", "Step 3 - Wait until checkpoint info are created - completed")
// time.Sleep(10)

Then, please try to increase the sleep time to 500 (second) to make sure that all the checkpoint data is save to the folder and sync between two node. Rerun the controller after modify the source code then retest.

the results of the command:kubectl describe pod netserver-pod-migration-XX

Name: netserver-pod-migration-7
Namespace: default
Priority: 0
Node: agent2/192.168.31.49
Start Time: Wed, 20 Mar 2024 13:50:20 +0800
Labels: app=netserver-pod
Annotations: snapshotPath: /var/lib/kubelet/migration/kkk/netserver
snapshotPolicy: restore
sourcePod: netserver-pod
Status: Running
IP: 10.244.2.73
IPs:
IP: 10.244.2.73
Controlled By: Podmigration/netserver-pod-migration-controller-40
Containers:
netserver-container:
Container ID: containerd://ccfdec5ba99103ebf04af8fc957b59861379f708780570799eddfe0b09d3b1dc
Image: zhanglongyao/netserver:v1
Image ID: docker.io/zhanglongyao/netserver@sha256:dc4c32a455518ad5138fc690511f96481861c271a928433eee2aac9dc9d09c73
Port: 12345/UDP
Host Port: 0/UDP
State: Terminated
Reason: StartError
Message: failed to start containerd task "ccfdec5ba99103ebf04af8fc957b59861379f708780570799eddfe0b09d3b1dc": OCI runtime restore failed: criu failed: type NOTIFY errno 0
log file: /var/lib/containerd/io.containerd.runtime.v1.linux/k8s.io/ccfdec5ba99103ebf04af8fc957b59861379f708780570799eddfe0b09d3b1dc/restore.log: unknown
Exit Code: 128
Started: Thu, 01 Jan 1970 08:00:00 +0800
Finished: Wed, 20 Mar 2024 13:50:46 +0800
Last State: Terminated
Reason: StartError
Message: failed to start containerd task "97ed88f3d49f516139b63eaca05e8f8ee2892dffaf810f12c78c106f57a9115f": OCI runtime restore failed: criu failed: type NOTIFY errno 0
log file: /var/lib/containerd/io.containerd.runtime.v1.linux/k8s.io/97ed88f3d49f516139b63eaca05e8f8ee2892dffaf810f12c78c106f57a9115f/restore.log: unknown
Exit Code: 128
Started: Thu, 01 Jan 1970 08:00:00 +0800
Finished: Wed, 20 Mar 2024 13:50:45 +0800
Ready: False
Restart Count: 25
Environment:
Mounts:
/app from nfs-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-68rlh (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
nfs-volume:
Type: HostPath (bare host directory volume)
Path: /nfs/data/01
HostPathType:
default-token-68rlh:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-68rlh
Optional: false
QoS Class: BestEffort
Node-Selectors: kubernetes.io/hostname=agent2
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message


Normal Scheduled 27s default-scheduler Successfully assigned default/netserver-pod-migration-7 to agent2
Normal Created 19s (x8 over 26s) kubelet, agent2 Created container netserver-container
Normal Started 19s (x8 over 26s) kubelet, agent2 Restored container netserver-container from checkpoint /var/lib/kubelet/migration/kkk/netserver/netserver-container
Normal Pulled 18s (x9 over 26s) kubelet, agent2 Container image "zhanglongyao/netserver:v1" already present on machine
it seems get fault at /var/lib/containerd/io.containerd.runtime.v1.linux/k8s.io/97ed88f3d49f516139b63eaca05e8f8ee2892dffaf810f12c78c106f57a9115f/restore.log,then i capture the path

path:/var/lib/containerd/io.containerd.runtime.v1.linux/k8s.io/97ed88f3d49f516139b63eaca05e8f8ee2892dffaf810f12c78c106f57a9115f/restore.log

image
i can not find the fils:97ed88f3d49f516139b63eaca05e8f8ee2892dffaf810f12c78c106f57a9115f,i think the fault is here,but idon't know why ang how to solve it?

the full yaml file of new created pod

apiVersion: v1
kind: Pod
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"labels":{"app":"netserver-pod"},"name":"netserver-pod","namespace":"default"},"spec":{"containers":[{"image":"zhanglongyao/netserver:v1","name":"netserver-container","ports":[{"containerPort":12345,"protocol":"UDP"}],"volumeMounts":[{"mountPath":"/app","name":"nfs-volume"}]}],"nodeSelector":{"kubernetes.io/hostname":"agent1"},"volumes":[{"hostPath":{"path":"/nfs/data/01"},"name":"nfs-volume"}]}}
snapshotPath: /var/lib/kubelet/migration/kkk/netserver
snapshotPolicy: restore
sourcePod: netserver-pod
creationTimestamp: "2024-03-20T05:50:20Z"
generateName: netserver-pod-migration-controller-40-
labels:
app: netserver-pod
managedFields:

  • apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
    f:metadata:
    f:annotations:
    .: {}
    f:kubectl.kubernetes.io/last-applied-configuration: {}
    f:snapshotPath: {}
    f:snapshotPolicy: {}
    f:sourcePod: {}
    f:generateName: {}
    f:labels:
    .: {}
    f:app: {}
    f:ownerReferences:
    .: {}
    k:{"uid":"a370d37b-0b6e-4cba-bc03-eb6af90a3944"}:
    .: {}
    f:apiVersion: {}
    f:blockOwnerDeletion: {}
    f:controller: {}
    f:kind: {}
    f:name: {}
    f:uid: {}
    f:spec:
    f:containers:
    k:{"name":"netserver-container"}:
    .: {}
    f:image: {}
    f:imagePullPolicy: {}
    f:name: {}
    f:ports:
    .: {}
    k:{"containerPort":12345,"protocol":"UDP"}:
    .: {}
    f:containerPort: {}
    f:protocol: {}
    f:resources: {}
    f:terminationMessagePath: {}
    f:terminationMessagePolicy: {}
    f:volumeMounts:
    .: {}
    k:{"mountPath":"/app"}:
    .: {}
    f:mountPath: {}
    f:name: {}
    k:{"mountPath":"/var/run/secrets/kubernetes.io/serviceaccount"}:
    .: {}
    f:mountPath: {}
    f:name: {}
    f:readOnly: {}
    f:dnsPolicy: {}
    f:enableServiceLinks: {}
    f:nodeSelector:
    .: {}
    f:kubernetes.io/hostname: {}
    f:restartPolicy: {}
    f:schedulerName: {}
    f:securityContext: {}
    f:terminationGracePeriodSeconds: {}
    f:volumes:
    .: {}
    k:{"name":"default-token-68rlh"}:
    .: {}
    f:name: {}
    f:secret:
    .: {}
    f:defaultMode: {}
    f:secretName: {}
    k:{"name":"nfs-volume"}:
    .: {}
    f:hostPath:
    .: {}
    f:path: {}
    f:type: {}
    f:name: {}
    manager: main
    operation: Update
    time: "2024-03-20T05:50:20Z"
  • apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
    f:status:
    f:conditions:
    k:{"type":"ContainersReady"}:
    .: {}
    f:lastProbeTime: {}
    f:lastTransitionTime: {}
    f:message: {}
    f:reason: {}
    f:status: {}
    f:type: {}
    k:{"type":"Initialized"}:
    .: {}
    f:lastProbeTime: {}
    f:lastTransitionTime: {}
    f:status: {}
    f:type: {}
    k:{"type":"Ready"}:
    .: {}
    f:lastProbeTime: {}
    f:lastTransitionTime: {}
    f:message: {}
    f:reason: {}
    f:status: {}
    f:type: {}
    f:containerStatuses: {}
    f:hostIP: {}
    f:phase: {}
    f:podIP: {}
    f:podIPs:
    .: {}
    k:{"ip":"10.244.2.73"}:
    .: {}
    f:ip: {}
    f:startTime: {}
    manager: kubelet
    operation: Update
    time: "2024-03-20T05:50:22Z"
    name: netserver-pod-migration-7
    namespace: default
    ownerReferences:
  • apiVersion: podmig.dcn.ssu.ac.kr/v1
    blockOwnerDeletion: true
    controller: true
    kind: Podmigration
    name: netserver-pod-migration-controller-40
    uid: a370d37b-0b6e-4cba-bc03-eb6af90a3944
    resourceVersion: "336198"
    selfLink: /api/v1/namespaces/default/pods/netserver-pod-migration-7
    uid: 5421fed8-7ffc-4276-b605-a4e2257a759b
    spec:
    containers:
  • image: zhanglongyao/netserver:v1
    imagePullPolicy: IfNotPresent
    name: netserver-container
    ports:
    • containerPort: 12345
      protocol: UDP
      resources: {}
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      volumeMounts:
    • mountPath: /app
      name: nfs-volume
    • mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-68rlh
      readOnly: true
      dnsPolicy: ClusterFirst
      enableServiceLinks: true
      nodeName: agent2
      nodeSelector:
      kubernetes.io/hostname: agent2
      preemptionPolicy: PreemptLowerPriority
      priority: 0
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: default
      serviceAccountName: default
      terminationGracePeriodSeconds: 30
      tolerations:
  • effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  • effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
    volumes:
  • hostPath:
    path: /nfs/data/01
    type: ""
    name: nfs-volume
  • name: default-token-68rlh
    secret:
    defaultMode: 420
    secretName: default-token-68rlh
    status:
    conditions:
  • lastProbeTime: null
    lastTransitionTime: "2024-03-20T05:50:20Z"
    status: "True"
    type: Initialized
  • lastProbeTime: null
    lastTransitionTime: "2024-03-20T05:50:20Z"
    message: 'containers with unready status: [netserver-container]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  • lastProbeTime: null
    lastTransitionTime: "2024-03-20T05:50:20Z"
    message: 'containers with unready status: [netserver-container]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  • lastProbeTime: null
    lastTransitionTime: "2024-03-20T05:50:20Z"
    status: "True"
    type: PodScheduled
    containerStatuses:
  • containerID: containerd://6bd1f6d89c84f75a4b516d12ffc1282b89422c569dd3c88cc1e8ef199a46d1ba
    image: docker.io/zhanglongyao/netserver:v1
    imageID: docker.io/zhanglongyao/netserver@sha256:dc4c32a455518ad5138fc690511f96481861c271a928433eee2aac9dc9d09c73
    lastState:
    terminated:
    containerID: containerd://6bd1f6d89c84f75a4b516d12ffc1282b89422c569dd3c88cc1e8ef199a46d1ba
    exitCode: 128
    finishedAt: "2024-03-20T05:50:47Z"
    message: |-
    failed to start containerd task "6bd1f6d89c84f75a4b516d12ffc1282b89422c569dd3c88cc1e8ef199a46d1ba": OCI runtime restore failed: criu failed: type NOTIFY errno 0
    log file: /var/lib/containerd/io.containerd.runtime.v1.linux/k8s.io/6bd1f6d89c84f75a4b516d12ffc1282b89422c569dd3c88cc1e8ef199a46d1ba/restore.log: unknown
    reason: StartError
    startedAt: "1970-01-01T00:00:00Z"
    name: netserver-container
    ready: false
    restartCount: 26
    started: false
    state:
    waiting:
    message: 'failed to reserve container name "netserver-container_netserver-pod-migration-7_default_5421fed8-7ffc-4276-b605-a4e2257a759b_27":
    name "netserver-container_netserver-pod-migration-7_default_5421fed8-7ffc-4276-b605-a4e2257a759b_27"
    is reserved for "16b5e46878c1407dd7b4ddc6dc6b81a149a45e9ebf0c5a49aecf85941163e2e8"'
    reason: CreateContainerError
    hostIP: 192.168.31.49
    phase: Running
    podIP: 10.244.2.73
    podIPs:
  • ip: 10.244.2.73
    qosClass: BestEffort
    startTime: "2024-03-20T05:50:20Z"

and i change the source code to "sleep(500)",it doesn't work. @vutuong sorry to disturb.

@120L020314 You are wellcome. Maybe it really is bug here.
One question, Is this setup work well with other app and not work with your app ?
An one more step to check problem. I write an example of yaml file to deploy a new pod from checkpoint data here:
https://github.com/SSU-DCN/podmigration-operator/blob/main/config/samples/podmig_v1_restore.yaml
Could you please try to modify it based on your own pod manifest yaml file? Then apply it ?

oh,sorry,i goto agent2 node and find the restore.log,it's information like this:
(00.070390) 1: Error (criu/files-reg.c:1831): Can't open file tmp/hsperfdata_root/1 on restore: No such file or directory
(00.070395) 1: Error (criu/files-reg.c:1767): Can't open file tmp/hsperfdata_root/1: No such file or directory
(00.070397) 1: Error (criu/mem.c:1383): `- Can't open vma
(00.103003) mnt: Switching to new ns to clean ghosts
(00.103362) Error (criu/cr-restore.c:2397): Restoring FAILED.
i think this is the reason why restore failed。but i don‘t know what are those files,do you know about that? thank you very much. @vutuong
image

Thank you for your prompt and patient response!i solve my problem by change my image, the problem is that criu can not checkpoint the file tmp/hsperfdata_root/1,so i change my image to not create that file.(it doesn't matter) and i try to migrate,that works.thank you for your teaching,i learned how to read the logs of containerd and criu's restore.log!
image
@vutuong