kubernetes-csi/csi-driver-host-path

Data lost after reboot

daresheep opened this issue · 16 comments

Hello,

I had using csi-driver-host-path V1.5 .0

After reboot system, both pod had been crashed.

describe pods , information this:

Events:
  Type     Reason                  Age              From                     Message
  ----     ------                  ----             ----                     -------
  Normal   Scheduled               20s              default-scheduler        Successfully assigned default/virt-launcher-firewall-wlh69 to ceph1
  Normal   SuccessfulAttachVolume  20s              attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-df463bd3-488b-4e03-b828-5923290f6cdb"
  Warning  FailedMount             0s (x4 over 4s)  kubelet                  MountVolume.SetUp failed for volume "pvc-df463bd3-488b-4e03-b828-5923290f6cdb" : rpc error: code = NotFound desc = volume id d2ec3050-7782-11eb-b03e-46ba88f41811 does not exist in the volumes list

After reboot, the mount information were losted, but discoveryExistingVolumes() is reading the data form "findmnt",

This makes all of volume information lost.

Can someone have other idea???

Thank you....

@pohly

Sir, Can you give me some help, thanks.

pohly commented

I had using csi-driver-host-path V1.5 .0

After the reboot you are still using that version? There were some changes in the code in v1.6.0, but nothing that should have made things worse. Just want to be sure.

Looking at the code, I suspect it was never meant to survive a reboot. Remember, this is a demo driver. It doesn't support all use-cases of a real driver.

Having said that, a PR which enhances the tracking of local volumes and snapshots would be welcome. V1.6.0 introduced capacity simulation, and the size of volumes are known to get lost when restarting the pod.

/help

@pohly:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

I had using csi-driver-host-path V1.5 .0

After the reboot you are still using that version? There were some changes in the code in v1.6.0, but nothing that should have made things worse. Just want to be sure.

Looking at the code, I suspect it was never meant to survive a reboot. Remember, this is a demo driver. It doesn't support all use-cases of a real driver.

Having said that, a PR which enhances the tracking of local volumes and snapshots would be welcome. V1.6.0 introduced capacity simulation, and the size of volumes are known to get lost when restarting the pod.

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Thank you for your help!

After the reboot you are still using that version? 

Yes, already using V1.5.0

Just upgrade to V1.6.0, this issue already exist.

I think i need to setup other CSI driver to handler this.

Thanks again.

I encountered this issue too with latest release (v1.6.2). I looked at the code and I think I've known the reason, the func discoveryExistingVolumes can not be used to discover existing volumes after reboot. It can only survive a pod restart, not a node reboot. I managed to get it work by getting the existing volumes from the PersistentVolumes.

@pohly Could you please take a look at my code and give any suggestions? If you agree I can open a PR (sure I will refine my code and add some unit tests). Thanks very much!

pohly commented

That function is also broken in other ways. I ran into that when trying to update the driver in Kubernetes E2E testing:
#210 (comment)

Let's use this issue to track that rewrite of the state saving code.

/reopen
/cc @fengzixu

@pohly: Reopened this issue.

In response to this:

That function is also broken in other ways. I ran into that when trying to update the driver in Kubernetes E2E testing:
#210 (comment)

Let's use this issue to track that rewrite of the state saving code.

/reopen
/cc @fengzixu

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pohly commented

@fengzixu you said that you wanted to work on this. Can you give an estimate when you might be done? This is relatively urgent because it blocks using the 1.5 and 1.6 driver releases for testing.

@fengzixu you said that you wanted to work on this. Can you give an estimate when you might be done? This is relatively urgent because it blocks using the 1.5 and 1.6 driver releases for testing.

@pohly I have worked on it. Is is ok for you to submit the fixing PR on next Monday? If there is any change about this time, I will sync up with you in this issue

pohly commented

Sounds good.

Updated: I am working on it today. But my work is little heavy. Let me sync up If I can submit this PR by tonight

pohly commented

Recovering state after a driver restart was fixed in #277.

However, the original ask in this issue was to also support host reboots. That's a bit different because mounted volumes become unmounted and need to be mounted again.

I don't think the hostpath driver needs to support that. It is clearly marked as "don't use in production" and I prefer to not add code that isn't needed for its original purpose (demos, E2E testing).

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.