geesefs 0.38.3 crash

Question

geesefs 0.38.3 crash

enp opened this issue a year ago · 11 comments

Install with helm install --namespace s3 --set secret.accessKey=<...> --set secret.secretKey=<..> csi-s3 yandex-s3/csi-s3 in yandex managed kubernetes + yandex object storage with default options

Works in simple cases but crashes on parallel write processes sometimes (can't reproduce every time) with error:

2023/12/22 10:23:10.105051 s3.WARNING Conflict detected (inode 437): failed to copy xyz/release to xyz/release.old: NotFound: Not Found
        status code: 404, request id: cd903bcf4d54ff5c, host id: . File is removed remotely, dropping cache
panic: xyz.removeName(release.old) but child not found: 5
goroutine 79859 [running]:
github.com/yandex-cloud/geesefs/internal.(*Inode).removeChildUnlocked(0xc000928000, 0xc0030e9000)
        /home/runner/work/geesefs/geesefs/internal/dir.go:983 +0x45f
github.com/yandex-cloud/geesefs/internal.(*Inode).SendUpload.func1()
        /home/runner/work/geesefs/geesefs/internal/file.go:1364 +0x4ef
created by github.com/yandex-cloud/geesefs/internal.(*Inode).SendUpload
        /home/runner/work/geesefs/geesefs/internal/file.go:1330 +0x38d
geesefs-enp_2dstorage.service: Main process exited, code=exited, status=2/INVALIDARGUMENT

Found error string in code - https://github.com/yandex-cloud/geesefs/blob/v0.38.3/internal/dir.go#L983 - so what this case means why geesefs should crash?

More general question: how other cases with panic as result should be processed by end user if occurred?

Answer 1 · 2023-12-27T08:59:44.000Z

This seems like a nontrivial rename conflict, i.e. you rename the same file to the same destination in parallel multiple times from independent mountpoints, or maybe even multiple times in one mountpoint.

I tried to reproduce it with for i in {1..100}; do mv renames/$i renames2/$i; done in two mountpoints in parallel but it didn't reproduce. So I added a symptomatic fix in master branch to prevent panic, but it would be better if you could create a reproduction script for this bug...

More general question: how other cases with panic as result should be processed by end user if occurred?

Unmount and remount the FS manually, I think... There's no good way to restore dead FUSE mounts in CSI driver (I tried some options), they are left in broken "transport endpoint not connected" state by the kernel and Kubernetes can't repair them - it should at least unmount them first, but it fails to even check the mountpoint when it's broken.