Extend the migration code
Closed this issue · 7 comments
Hi @vutuong
For our work, we would like to extend your code to implement a pre-dump approach and later on also volume migration. Therefore I tried to understand your changes in the code but a few open questions remain. Perhaps you still know the code and can give me some help.
- There are a lot of things considering Docker and DockerShim as well as files with fake in its name. Since you don't use docker (anymore) I assumed that these files are not important to the final solution and can be ignored.
- I created a basic diagram of some of the important files of the migrate command. Thereby I have to main open points.
- On the right side I don't get where the command is going. I think somewhere should be the CRIU commands, which are an important part that I have to change, but unfortunately, I couldn't locate them. If these are just functions called in containerd, where did you find the API for the CRIU commands?
- In the kubelete part there is also some confusion about which path the command is following and therefore which files are important to me. For example the "kuberuntime_container.go" as well as the "instrumented_service.go" are calling the internalapi.RuntimeService.CheckPointContainer if I understand it correctly.
Hi @Minninnewah,
- For the first question, the command not directly go to CRIU. Containerd-cri here just a GRPC server that listen the request from client ( kubelet). Please check: https://kubernetes.io/blog/2018/05/24/kubernetes-containerd-integration-goes-ga/ to understand the different between containerd-cri and containerd. Basically, after containerd-cri received requests from kubelet, the request will be processed by containerd itself. The container runtime such as: docker or containerd already had function that talk to CRIU to request checkpoint and restore without k8s.
- For the second question, the missing part is GRPC connection between client (kubelet) and containerd-cri doesn't have any method called checkpoint and restore. In this PoC, we use the work from https://github.com/schrej/containerd-cri that extended GRPC server site to add checkpoint/restore method mapping to the checkpoint/restore task in containerd. Beside, in the GRPC client site (kubelet), we also extend to have checkpoint/restore method. You can found the method in
pkg/kubelet/cri/remote/remote_runtime.go
- As a result, we didn't care much about how conatainerd talks to criu. If you want to extend the predumpt at container runtime level, you have to look up for the source code of containerd itself. Another suggest from me, for my best knowledge, you can basically request the checkpoint multi time at kubelet level and rsync the checkpoint data to destination node and restore.
In this repo https://github.com/oiasam/ARNAB, the author show how he implement the predump within his paper https://ieeexplore.ieee.org/document/8907466. In this implement, they write a simple operator/controller in bash script that request lxc-container to checkpoint and restore. The same idea here is that we have k8s request to containerd ...
@vutuong Thank you for this extended information.
I checked out the containerd code and tried to replace it with my own version (a fork from the containerd commit that was for the release build 1.3.6).
It looks like this works but only if I use the precompiled containerd-cri
cd containerd/ #additional
wget https://k8s-pod-migration.obs.eu-de.otc.t-systems.com/v2/containerd
git clone https://github.com/SSU-DCN/podmigration-operator.git
cd podmigration-operator
tar -vxf binaries.tar.bz2
cd custom-binaries/
chmod +x containerd
sudo mv containerd /bin/
But as soon as I replace it with the following command an error occurs.
git clone https://github.com/Minninnewah/containerd-cri
cd containerd-cri/
go get github.com/containerd/cri/cmd/containerd
make
#sudo make install
sudo -E env "PATH=$PATH" make install
cd _output/
sudo mv containerd /bin/
The setup and running of the video pod still works fine but at the migration command I got the following error on the worker1 kubelet:
Do you know what's wrong?
Additionally, I don't understand the "go get" from the containerd repository. Doesn't the version matter, since there is no specific version defined?
hi @Minninnewah,
Please check if the containerd service still running well after replace the default binary with your own version. You should need to run command such as: systemctl status containerd
, journalctl -xf -u containerd
to get the log information.
Rememnber to set : $ chmod +x containerd
before run the sudo mv containerd /bin/
FYI: Our PoC looks like out of date since a CRIU's maintainer from Redhat made some commits to K8s to achieve the migration goal.
https://www.youtube.com/watch?v=wCb1Rfoy7Fk
https://martinheinz.dev/blog/85?fbclid=IwAR072EnO8jQyxPtwtqdPMRLS7Ngi3XrAmllbpnHbngc90cvr5lB2l8QC5ME
@vutuong
With systemctl status containerd
on worker1 I can't see anything wrong:
And journalctl -xf -u containerd
also didn't show any errors:
The only hint I can find is the log from the last post when using journalctl -xf -u kubelet
on worker1.
And in the podmigration operator (the blocking make run command):
Btw the container-cri GitHub repo I'm using is just a Fork from yours so the problem should not be in the code but rather in the compile or move commands. Did you use the same commands from your description to build the binaries?
cd $HOME/tmp
git clone https://github.com/Minninnewah/containerd-cri
cd containerd-cri/
go get github.com/containerd/cri/cmd/containerd #Not sure about this step or if i need to use my own girhub repo again
make
#sudo make install
sudo -E env "PATH=$PATH" make install
cd _output/
chmod +x containerd
sudo mv containerd /bin/
@vutuong
I did some further investigation and can probably narrow down the error.
When I compare the binaries of the precompiled and the fresh build, they do not match.
Since different container versions (for installing the dependencies) did not affect the build in various tests, it looks like you use a different code base for building the precompiled containerd-cri. Can you tell me which version you used in the build process?
@Minninnewah
Sorry for the late reply.
So, in your case, You can make the podmigration with the pre-complied binary but not work with your own ?
Could please check the md5sum of two binary file ? I used to pull the lastest version of library go get github.com/containerd/cri/cmd/containerd
at that time as I noted in the document.