Data lost after reboot
daresheep opened this issue · 16 comments
Hello,
I had using csi-driver-host-path V1.5 .0
After reboot system, both pod had been crashed.
describe pods , information this:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 20s default-scheduler Successfully assigned default/virt-launcher-firewall-wlh69 to ceph1
Normal SuccessfulAttachVolume 20s attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-df463bd3-488b-4e03-b828-5923290f6cdb"
Warning FailedMount 0s (x4 over 4s) kubelet MountVolume.SetUp failed for volume "pvc-df463bd3-488b-4e03-b828-5923290f6cdb" : rpc error: code = NotFound desc = volume id d2ec3050-7782-11eb-b03e-46ba88f41811 does not exist in the volumes list
After reboot, the mount information were losted, but discoveryExistingVolumes() is reading the data form "findmnt",
This makes all of volume information lost.
Can someone have other idea???
Thank you....
I had using csi-driver-host-path V1.5 .0
After the reboot you are still using that version? There were some changes in the code in v1.6.0, but nothing that should have made things worse. Just want to be sure.
Looking at the code, I suspect it was never meant to survive a reboot. Remember, this is a demo driver. It doesn't support all use-cases of a real driver.
Having said that, a PR which enhances the tracking of local volumes and snapshots would be welcome. V1.6.0 introduced capacity simulation, and the size of volumes are known to get lost when restarting the pod.
/help
@pohly:
This request has been marked as needing help from a contributor.
Please ensure the request meets the requirements listed here.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help
command.
In response to this:
I had using csi-driver-host-path V1.5 .0
After the reboot you are still using that version? There were some changes in the code in v1.6.0, but nothing that should have made things worse. Just want to be sure.
Looking at the code, I suspect it was never meant to survive a reboot. Remember, this is a demo driver. It doesn't support all use-cases of a real driver.
Having said that, a PR which enhances the tracking of local volumes and snapshots would be welcome. V1.6.0 introduced capacity simulation, and the size of volumes are known to get lost when restarting the pod.
/help
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Thank you for your help!
After the reboot you are still using that version?
Yes, already using V1.5.0
Just upgrade to V1.6.0, this issue already exist.
I think i need to setup other CSI driver to handler this.
Thanks again.
I encountered this issue too with latest release (v1.6.2). I looked at the code and I think I've known the reason, the func discoveryExistingVolumes
can not be used to discover existing volumes after reboot. It can only survive a pod restart, not a node reboot. I managed to get it work by getting the existing volumes from the PersistentVolumes.
@pohly Could you please take a look at my code and give any suggestions? If you agree I can open a PR (sure I will refine my code and add some unit tests). Thanks very much!
That function is also broken in other ways. I ran into that when trying to update the driver in Kubernetes E2E testing:
#210 (comment)
Let's use this issue to track that rewrite of the state saving code.
/reopen
/cc @fengzixu
@pohly: Reopened this issue.
In response to this:
That function is also broken in other ways. I ran into that when trying to update the driver in Kubernetes E2E testing:
#210 (comment)Let's use this issue to track that rewrite of the state saving code.
/reopen
/cc @fengzixu
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
@fengzixu you said that you wanted to work on this. Can you give an estimate when you might be done? This is relatively urgent because it blocks using the 1.5 and 1.6 driver releases for testing.
@fengzixu you said that you wanted to work on this. Can you give an estimate when you might be done? This is relatively urgent because it blocks using the 1.5 and 1.6 driver releases for testing.
@pohly I have worked on it. Is is ok for you to submit the fixing PR on next Monday? If there is any change about this time, I will sync up with you in this issue
Sounds good.
Updated: I am working on it today. But my work is little heavy. Let me sync up If I can submit this PR by tonight
Recovering state after a driver restart was fixed in #277.
However, the original ask in this issue was to also support host reboots. That's a bit different because mounted volumes become unmounted and need to be mounted again.
I don't think the hostpath driver needs to support that. It is clearly marked as "don't use in production" and I prefer to not add code that isn't needed for its original purpose (demos, E2E testing).
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Reopen this issue or PR with
/reopen
- Mark this issue or PR as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close
@k8s-triage-robot: Closing this issue.
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied- After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied- After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closedYou can:
- Reopen this issue or PR with
/reopen
- Mark this issue or PR as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.