kubernetes/git-sync

Add possibility to control worktree lock with a flag

bakome opened this issue · 13 comments

bakome commented

Proposal

In some cases GitSync is used with portable or NFS volumes, which by nature can be mounted and unmounted multiple times during lifetime of application.

Git documentation states in this case when using NFS or portable volume to lock the worktree.

If the working tree for a linked worktree is stored on a portable device or network share which is not always mounted, you can prevent its administrative files from being pruned by issuing the git worktree lock command, optionally specifying --reason to explain why the worktree is locked.

This feature will prevent cases explained in #827

bakome commented

I will prepare PR for this implementation.

bakome commented

I understand that they are in same volume, but that do not appear to work. Somehow Git do not handle that well, and they are deleted which lead to the bad loop. Just to mention I am not speaking about local synced repositories, I was referring to removal of complete volume of ./worktree directory.

To be more specific this is a case example when this happens:

  • NFS server for mount is starter
  • GitSync container is start and running
  • Sync is OK everything good
  • NFS Server is restarted
  • GitSync container stuck executing command
  • NFS Server is back online
  • GitSync continue execution * This is were problem starts

As you can see nothing is changed manually, but somehow I assume that Git mess up the worktrees.
Just to mention i was able to repeat this multiple times on Kubernetes and with Docker Compose.

I am not speaking about local synced repositories, I was referring to removal of complete volume of ./worktree directory

One of four things happened:

  1. the underlying volume is blinking, which means the root/.git directory is also gone (so locking the worktree won't matter)

  2. someone actively deleted the worktrees (so locking won't matter); this was my above comment - we can try to recover but I don't think super-heroic measures are justified

  3. a bug in git-sync caused (2), which is why "try to recover" is acceptable to me :)

  4. You have mounted a different volume at $root/.worktrees, which seems like a terrible idea and is abusing an implementation detail.

somehow I assume that Git mess up the worktrees

This is what I'd like to understand - I would HOPE that git would hang until NFS comes back, and then proceeed. And if that is not true, I would hope that there would be an error. And if that is not true, I would hope that the next sync-period's sanity-check phase would fail, and we would start over. And if THAT is not true, I don't know. It means the worktree exists and passes fsck but is somehow corrupted? How can we detect that?

I'm not saying I don't believe you, just that I don't want to add random git operations unless I know what they do and why they matter.

bakome commented

Yes I agree, don't worry I respect good debate. With more debug I believe that probably you are right about the worktree, but I will do more research to be sure on that. What I am 100% sure until know is that lost of the volume cause git command to reach "context deadline exceeded" which is probably cause of this. However as I mention I will spend little more time to find the root of the issue.

What also I am sure of 100% is that the fix #828 fix this issue, but is not addressing the root cause.

Ahh, "context deadline exceeded" makes sense - there is a timeout, and if that runs out, we just abort the sync. That could leave things inconsistent, but the sanity check at the beginning of each sync look is SUPPOSED to catch this. I can't unwind every change, but I can nuke it and start over. :)

Your other patch was a great find, and I have added another testcase to cover the same failure in a different way. Now that this is merged, I wonder how much more we can/should handle. If there's a corruption case you can find that leaves an incorrect repo but still passes the sanity-check, THAT is something I want to fix.

I have a couple ideas of things to test (like ensuring that git rev-parse HEAD in the worktree gives the right result and maybe even checking for a "dirty" repo (with uncommitted changes), but I bet that breaks someone, so I doubt we can do it.

bakome commented

Yes, the patch fix the issue, but do not prevent it. I will try to find it in details and make a new patch. Probable we can close this issue or we can keep just for reference.

bakome commented

@thockin I think i figure it our how this happen. I think that the main cause is sanityCheckWorktree, but I believe that the same can happen with Repo Sanity Check too, however that cannot lead to this error because git init from start should be good. The things is in worktree sanity check git fsck took more time to finish and we end with "context deadline exceeded" which probably return false on that function. Even if we remove git fsck probably dir empty check will return false which will lead to main cause of this problem.

git.log.V(0).Info("worktree failed checks or was empty", "path", currentWorktree)
if err := git.removeWorktree(ctx, currentWorktree); err != nil {
    return false, "", err
}
currentHash = ""

The thing is that in removeWorktree the first check is

case os.IsNotExist(err):
		return nil

So in previous check we see if err is different of nil and we did not return false on sync and lead to setting up the current hash to empty string. And after that the loop start to happen.

And after this the loop every time when is doing the sanity check of repo it fails because directory was missing. However I again mention that #828 fix the issue, but I believe if I change removeWorktree function to return err when worktree path is not found it will prevent this to not happen even without #828 fix.

@thockin One question just to consider, what do u think if is good sanityCheckWorktree and sanityCheckRepo to have mechanism and fail the SyncRepo current run if "context deadline exceed" happen?

I think #828 is a correct change, anyway.

In case of a deadline we should exit as cleanly as possible, so if there are paths where that can be better, that sounds ok.

We're definitely hitting these issues, what's your feel for the timing of a 4.x release with this included?

https://github.com/kubernetes/git-sync/releases/tag/v4.1.0 is currently in the promotion queue - needs an approval and then it will be published.

Appreciate you sir!