Operator Fails to restore pod

Question

Operator Fails to restore pod

Closed this issue 3 years ago · 10 comments

Hello,

I am trying to use the code you have in this repository. I have followed the setup steps indicated in the guide. But I noticed the following:

live migration does not give the same results as shown in the video (it does not do anything)
the checkpoint works only using the annotate through kubectl, and not through the plugin
when I restore a pod, the status shows up as ExitCode:0 and it does not seem as if something is running in the restored container.

I would appreciate any pointers related to these issues, are there any modifications I have to do in the code (changing some hardcoded paths for example). Also I would appreciate if you can pinpoint which version of docker was used in the demo in the video.

Thank you!

Answer 1 · 2021-09-23T12:50:22.000Z

Thank you for using my repo. But can you please give me some logs, which can help me to check what is problem with your installation. Please check this Closed issue for some info: #2

P/s please check the document again because my implement was base on Kubernetes cluster with Containerd runtime but not Docker. You can find how to init Kubernetes cluster here:
https://github.com/SSU-DCN/podmigration-operator/blob/main/init-cluster-containerd-CRIU.md

@59nezytic Can you please help him to fix it ?

Answer 2 · 2021-09-23T13:11:58.000Z

Ok!
I'll help him as much I can.

Thank you for using my repo. But can you please give me some logs, which can help me to check what is problem with your installation. Please check this Closed issue for some info: #2

P/s please check the document again because my implement was base on Kubernetes cluster with Containerd runtime but not Docker. You can find how to init Kubernetes cluster here:
https://github.com/SSU-DCN/podmigration-operator/blob/main/init-cluster-containerd-CRIU.md

@59nezytic Can you please help him to fix it ?

Answer 3 · 2021-09-23T13:34:33.000Z

Hello,

Thank you both for your prompt response.

When I use "kubectl checkpoint..." I get the following error:

Operation cannot be fulfilled on pods "petstore": the object has been modified; please apply your changes to the latest version and try again

kubectl migrate give me the following, and nothing else happens

&{petstore-migration-controller-53 k8s-node2 0 &LabelSelector{MatchLabels:map[string]string{podmig: dcn,},MatchExpressions:[]LabelSelectorRequirement{},} live-migration  petstore {{      0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] []  []} {[] [] [] []  <nil> <nil>  map[]   <nil>  false false false <nil> nil []   nil  [] []  <nil> nil [] <nil> <nil> <nil> map[] [] <nil> }} <nil>}
petstore
response Status: 200 OK
{
 "name": "petstore-migration-controller-53",
 "destHost": "k8s-node2",
 "replicas": 0,
 "selector": {
  "matchLabels": {
   "podmig": "dcn"
  }
 },
 "action": "live-migration",
 "snapshotPath": "",
 "sourcePod": "petstore",
 "template": {
  "metadata": {
   "creationTimestamp": null
  },
  "spec": {
   "containers": null
  }
 },
 "status": {
  "state": "",
  "currentRevision": "",
  "activePod": ""
 }
}

I had another issue with kubectl-checkpoint but I managed to fix it (the path to kubeconfig is hardcoded to a very specific path in line 88 of the checkpoint plugin, it would be better to set it to ~/.kube/config ) I had to change that path and rebuild the plugin.
I am wondering if there are some paths in kubelet or containerd that need to be changed too before rebuilding

Answer 4 · 2021-09-23T13:36:21.000Z

Yeah one more thing, the checkpoint taken directly using criu looks way different than the one taken by the kubectl plugin, the checkpoint taken by the kubectl plugin contains only one small file, while the checkpoint taken by criu directly on a running container contains a lot of .img files

Answer 5 · 2021-09-24T05:42:02.000Z

Maybe there seems to be a problem during the installation process.
Can I get your operating environment?

Answer 6 · 2021-09-24T05:58:04.000Z

@ojebbar Firstly, please let @59nezytic help you to recheck your installation.
Secondly, In my controller source code, I just only check if the descriptors.json file exist or not (line 144 -153). If this file is exist, the loop will be break and in the next lines, controller will trigger the restore process in destination node.
Because different containers have different files in checkpoint folders. The only same file is descriptors.json.
Maybe the logic is wrong here that because your container data is too big and it need more time to fully restore information eventhough the descriptors.json was created.
If you have any idea to better check if your checkpoint info are fully created, please help to contribute to our implementation.

Answer 7 · 2021-09-24T15:02:41.000Z

@59nezytic my setup is ubuntu 18.04, I used the same versions of containerd, kubernetes, and runc as indicated in the guide [1]. I did not notice any issues during the installation, but as @vutuong said, it may be an issue because I am using docker. I will rebuild a fresh one this afternoon and give you more updates.

[1] https://github.com/SSU-DCN/podmigration-operator/blob/main/init-cluster-containerd-CRIU.md

Answer 8 · 2021-09-24T20:52:17.000Z

Hello,

I am here just to update you on the situation.

First, Thank you @vutuong for mentioning that the operator only works with containerd, apparently my problem was that I was using docker as a runtime. Once I limited myself to containerd I managed to reproduce your results.

Second, it seems that when the pods are not created manually, but created using a deployment, when you trigger a migration the migration operator creates a new pod and so does the deployment controller, as a result you endup with more endpoints for the service than necessary. If you can give me some hints as to how to fix that (where I should look in order to figure out a fix), I will be glad to take care of it.

Thank you again for your help and your assistance.

Answer 9 · 2021-09-25T09:42:17.000Z

@ojebbar
Firstly , thank you for pointing out the problem. You are correct in pointing out the measurement limitations. You are wellcome to help us contribute to it.

In my implementation I create a new controller each time trigger a migration process to control this process and manage the new pod. So the implementation is missing the part of deleting previous migration controller since I only wrote the code to delete the pod only.
Each pod only is under the control of one controller such as ( deployment, statefulset,...). So the operator should be the high layer than the controller which creates the pod.
You can extend this implementation to check what is the controller which create a pod, then, create this object with checkpoint/restore annotations. After finishing the migration process, delete the old object. As a result, the migrate command should be something like : kubectl migrate (deployment| statefulset | pod ...) podName target. But in single cluster, the migration looks like only appreciate for a deployment with one pod because the pods is distributed in different nodes. Across many cluster, you can migrate the whole deployment.

Answer 10 · 2021-09-27T13:50:20.000Z

@vutuong Thank you for the hints, I will look into it and get back to you.