GitoxideLabs/gitoxide

`gix clean -xde` deletes the repo's own hidden nested worktrees, but they are not really hidden

Closed this issue ยท 5 comments

Current behavior ๐Ÿ˜ฏ

#1465 fixed #1464 by checking if a git repository encountered during traversal is a worktree of the current repository and declining to delete it even if -p is passed even if -r is passed, and also even if -x and -d are passed to delete ignored directories. This protects the repository's own worktrees, even if nested, provided that they are encountered during traversal rather than being present below a directory that is found during traversal and determined to be ignored.

As described in #1464 (comment), gix clean continues to require --skip-hidden-repositories to avoid deleting nested repositories that require full traversal to discover, and this is important for performance. This behavior holds even when those nested repositories are worktrees of the current repository.

But worktrees of the current repository, regardless of their location, should never require much work to discover. This is because the repository knows where its worktrees are. For example, a non-bare repository with git worktree managed worktrees, if it is not a submodule, has a .git directory with a worktrees subdirectory that contains information about its worktrees.

Whether a directory contains any descendants that are worktrees of the current repository should likewise be efficient to determine and likewise does not require full traversal.

Broad implementation idea

I suspect that either there is a simpler and more elegant approach, or maybe I am just expressing this idea poorly and it can be implemented simply? I am not sure. I present this more to clarify why I believe this can be achieved efficiently than to argue for a particular algorithm.

At the beginning of traversal, before actually beginning the walk:

  1. Examine the repository .git/worktrees directory or equivalent and make a list of its worktrees.
  2. Remove any from the list that physically do not exist on disk. This requires checking if they exist.
  3. Normalize their paths relative to the root of the incipient traversal, i.e., so they are the same as they will be seen to be in traversal.
  4. Remove any from the list that are not inside the repository. Preferably, remove any that are not nested inside the working tree that gix clean is operating on. If feasible, remove any that are outside the part of it that gix clean is operating on, if that is not the whole thing.
  5. Optionally, as a performance optimization (if such an optimization is justified), fall back to the old implementation if the list is empty. At this point, the list will usually be empty, because usually there are no nested worktrees.
  6. Store those paths, or maybe explicitly store each intermediate subdirectory path, in such a way that allows recursive deletion of a complete subdirectory that contains them to be avoided.

For elaboration on that last pre-step:

  • This is conceptually a set, but I don't know what data structure should be used. Usually a repository has a small number of worktrees; making weird cases with numerous worktrees pretty fast would be good, but avoiding any more than a tiny slowdown with only a few worktrees is a must. With no worktrees, the slowdown of the whole algorithm compared to the current behavior should be negligible or zero.
  • The situation with intermediate directories is conceptually similar, and should perhaps be handled similarly, to that of intermediate ignored directories under which there is a tracked file (such as by git add -f).

Then perform the walk and use the knowledge of where the worktrees are and which paths are ancestors of them to refrain from deleting an entire directory tree if any of the repository's own worktrees are anywhere inside it, while still deleting all other files that are eligible for deletion.

Complexity and connection to reusable library features

If this cannot be done in a reasonably simple way, then it might be judged not to be justified on the grounds that it is rarely needed.

However, whether or not that is the case, I wonder if this would become simple if expressed in terms of new directory walk features that would themselves have other applications.

Or maybe it is already simple and my unfamiliarity with the code involved--this is very much a part of the code that I am less familiar with--is what makes it seem complicated to me.

Expected behavior ๐Ÿค”

Deleting nested worktrees of the same repository, no matter how deep under ignored directories, is something I do not expect to happen. I think the expected behavior is to preserve them. I have two reasons to think so.

1. On hiddenness - analogy to tracked files under ignored directories

As briefly alluded to above, even in the absence of multiple worktrees, one can have tracked files nested arbitrarily deep inside ignored directories by adding their .gitignore entries later or by using git add -f. These files are preserved by gix clean, which has efficient access to knowledge of their existence because their locations, and thus knowledge of their ancestor directories, is present in the repository without requiring a full traversal.

Of course, this analogy breaks down in some ways, since there are Git tree objects representing those intermediate directories, while the repository's knowledge of its own git worktree managed worktrees is, as far as I know, only ever through its knowledge of their paths. But tree objects don't exist for intermediate directories under which all tracked files are staged but not committed--and gix clean seems to always do fine with this--so perhaps this analogy is not so bad even at a technical level?

Either way, I think the analogy holds in terms of what is intuitive to users, as well as what is being described informationally by the characterization of a directory as hidden.

Since gix clean handles that, and gix clean also seeks to avoid deleting any worktrees of the current repository, I think the intuitively expected behavior is to treat them the same way.

2. On performance - inferring expectation from design intent

Because the repository's worktrees are never hidden in the sense of requiring a dirwalk or other inherently slow operations to find them, users who are aware of the performance considerations related to --skip-hidden-repositories may intuitively assume that the repository's own worktrees are protected even without it.

This assumption would hold even if the implementation details required to protect them efficiently turn out to be unworthwhile. Of course, that does not imply that implementing it must be worthwhile. Further documentation changes could instead be made to avoid giving the impression that they would be protected.

Maybe there are simpler alternatives?

All the foregoing text considers simplicity in terms of whether the behavior I am advocating for could be done in an acceptably simple way. But maybe there is a different improvement that could be made that is inherently simpler.

One idea: The message shown when -xdn is passed without --skip-hidden-repositories is itself clear with respect to separate repositories in ignored directories. Suppose the repository has other worktrees of its own (besides the one being operated in) whose parent directories appear to match any .gitignore entry. Then that message could be augmented to also cover that. This could be by the addition of a separate trailing sentence or paragraph, or even just by adjusting its wording to say "repositories or worktrees" or something.

Another idea is to always have it contain that augmentation, but I think that might be more confusing, and less useful as a reminder that the situation might really apply.

Git behavior

I recently learned that git clean accepts -f twice in which case it does remove nested repositories. When passed with -d and -x, this removes ignored nested worktrees as well, both those that are directly visible inside a non-ignored parent directory, and those that are deeper down.

But when using only documented features, git clean avoids deleting any nested repositories, even those that are unrelated to the current repository. In contrast, gix clean offers this as a feature explicitly, and provides options to control it.

I think both of the above points are of low relevance to determining how gix should behave in the specific case this issue covers, and that the git behavior, to the extent that it is related, argues neither for nor against the proposal here.

Steps to reproduce ๐Ÿ•น

Instructions

I did the following on an Ubuntu 22.04 LTS system.

  1. Build and installing gitoxide from the tip of the main branch, making sure the build is from code that included the changes from #1465.

  2. Create a repository specifying an ignored path that will be used for a subdirectory:

    git init has-deeply-nested-worktree
    cd has-deeply-nested-worktree
    echo subdir >.gitignore
    git add .
    git commit -m 'Initial commit'
  3. Create a worktree in a directory inside that ignored directory, so that the worktree's root is a subdirectory of it:

    mkdir subdir
    git worktree add subdir/mybranch
  4. See what git clean would do with -x and -d, then observe that it does it:

    gix clean -xdn
    gix clean -xde
    git worktree list
    ls subdir

This shows that the worktree is gone, even though it is a worktree of the current repository rather than being an unrelated nested repository.

With output

Here's a transcript of the above, showing the output:

ek@Glub:~/src$ git init has-deeply-nested-worktree
Initialized empty Git repository in /home/ek/src/has-deeply-nested-worktree/.git/
ek@Glub:~/src$ cd has-deeply-nested-worktree
ek@Glub:~/src/has-deeply-nested-worktree (main #)$ echo subdir >.gitignore
ek@Glub:~/src/has-deeply-nested-worktree (main #%)$ git add .
ek@Glub:~/src/has-deeply-nested-worktree (main +)$ git commit -m 'Initial commit'
[main (root-commit) e53b380] Initial commit
 1 file changed, 1 insertion(+)
 create mode 100644 .gitignore
ek@Glub:~/src/has-deeply-nested-worktree (main)$ mkdir subdir
ek@Glub:~/src/has-deeply-nested-worktree (main)$ git worktree add subdir/mybranch
Preparing worktree (new branch 'mybranch')
HEAD is now at e53b380 Initial commit
ek@Glub:~/src/has-deeply-nested-worktree (main)$ gix clean -xdn
WOULD remove subdir/ (๐Ÿ—‘๏ธ)

WARNING: would remove repositories hidden inside ignored directories - use --skip-hidden-repositories to skip
ek@Glub:~/src/has-deeply-nested-worktree (main)$ gix clean -xde
removing subdir/ (๐Ÿ—‘๏ธ)
ek@Glub:~/src/has-deeply-nested-worktree (main)$ git worktree list
/home/ek/src/has-deeply-nested-worktree                  e53b380 [main]
/home/ek/src/has-deeply-nested-worktree/subdir/mybranch  e53b380 [mybranch] prunable
ek@Glub:~/src/has-deeply-nested-worktree (main)$ ls subdir
ls: cannot access 'subdir': No such file or directory

Thanks for bringing this up! It's true that worktrees in hidden directories would themselves be hidden, and thus could be removed as part of the warning WARNING: would remove repositories hidden inside ignored directories - use --skip-hidden-repositories to skip.

Or maybe it is already simple and my unfamiliarity with the code involved--this is very much a part of the code that I am less familiar with--is what makes it seem complicated to me.

The algorithm has all the capabilities needed to handle this correctly. Instead of not descending into an ignored directory, it could decide to descend if it detects an included (i.e. worktree_dir.has_prefix(ignored_dir)) worktree. From there it would near-automatically list what it needs to to delete all ignored directories and files 'around' the worktree which is not to be touched.
Similar behaviour can be found if there is a mix of ignored and untracked entries, I think. However, it could also be that there are details that I am missing which would make this more difficult, but in principle it's all there.

Since gix clean handles that, and gix clean also seeks to avoid deleting any worktrees of the current repository, I think the intuitively expected behavior is to treat them the same way.

I agree, and this analogy might be a reason that implementing this is easier as one might think after all, if the right classification is found. Out of the back of my head, I really don't know how it works exactly ๐Ÿ˜….

But when using only documented features, git clean avoids deleting any nested repositories, even those that are unrelated to the current repository. In contrast, gix clean offers this as a feature explicitly, and provides options to control it.

Indeed, git has the benfit of not having to deal with worktrees specifically at all, and now we know that it definitely doesn't do that either. Maybe that's also the reason -ff isn't documented, and I agree with your conclusion on the comparison.


How to proceed from here

If possible, would you contribute the documentation change that bluntly alters the warning to also mention worktrees? That way people who have setups like this, worktrees in ignored directories, would be more specifically warned.
Edit: You could also share a commit and I apply it to #1470.

Then I'd leave this issue open and see if there is demand by users for fixing this properly, because like I said, technically it's all there and it's just a matter of adding a new test - from there the change could 'write itself' especially once it's understood how tracked files in ignored directories cause the algorithm to descend. This might be key for a simple fix, and maybe it's something you would like to try - I'd encourage it only if you think there is something else to learn though.

For today, I definitely don't feel like tackling this, but I might return here even before there is user demand as the fix might be easy. For good measure, I will open a PR that at least tests the status quo and maybe play around with it real quick for a chance to a simple fix.

Actually, I will post a fix for that in a moment, no need to adjust the warning message in that regard.

Here is how this looks now:

โฏ git worktree add target/foobar
Preparing worktree (checking out 'foobar')
HEAD is now at 0e8508a62 Release gix v0.64.0, gix-fsck v0.4.1, gitoxide-core v0.39.0, gitoxide v0.37.0
โฏ git worktree list
/Users/byron/dev/github.com/Byron/gitoxide                2e636beeb [fix-clean]
/Users/byron/dev/github.com/Byron/gitoxide-feature        802dfee01 [shared-parallel]
/Users/byron/dev/github.com/Byron/gitoxide/target/foobar  0e8508a62 [foobar]
โฏ gix clean -xd
WOULD remove .DS_Store (๐Ÿ—‘๏ธ)
WOULD remove etc/.DS_Store (๐Ÿ—‘๏ธ)
WOULD remove etc/monthlies/.DS_Store (๐Ÿ—‘๏ธ)
WOULD remove empty gix-actor/tests/fixtures/
WOULD remove gix-config/fuzz/target/ (๐Ÿ—‘๏ธ)
WOULD remove gix-dir/tests/fixtures/generated-do-not-edit/ (๐Ÿ—‘๏ธ)
WOULD remove target/ (๐Ÿ—‘๏ธ)

WARNING: would remove repositories hidden inside ignored directories - use --skip-hidden-repositories to skip (Skipped 26 precious entries - show with -p)
WARNING: would remove repositories hidden inside untracked directories - use --find-untracked-repositories to find
โฏ cargo run --bin gix -- clean -xd
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.40s
     Running `target/debug/gix clean -xd`
WOULD remove .DS_Store (๐Ÿ—‘๏ธ)
WOULD remove etc/.DS_Store (๐Ÿ—‘๏ธ)
WOULD remove etc/monthlies/.DS_Store (๐Ÿ—‘๏ธ)
WOULD remove empty gix-actor/tests/fixtures/
WOULD remove gix-config/fuzz/target/ (๐Ÿ—‘๏ธ)
WOULD remove gix-dir/tests/fixtures/generated-do-not-edit/ (๐Ÿ—‘๏ธ)
WOULD remove target/.rustc_info.json (๐Ÿ—‘๏ธ)
WOULD remove target/CACHEDIR.TAG (๐Ÿ—‘๏ธ)
WOULD remove target/debug/ (๐Ÿ—‘๏ธ)
WOULD remove target/package/ (๐Ÿ—‘๏ธ)
WOULD remove target/release/ (๐Ÿ—‘๏ธ)
WOULD remove target/tmp/ (๐Ÿ—‘๏ธ)

WARNING: would remove repositories hidden inside ignored directories - use --skip-hidden-repositories to skip (Skipped 26 precious entries - show with -p)
WARNING: would remove repositories hidden inside untracked directories - use --find-untracked-repositories to find

Without the fix, it would delete target/ which now contains a protected directory, but with the fix it lists everything around the now protected worktree.

Actually, I will post a fix for that in a moment, no need to adjust the warning message in that regard.

This does not need to get in the way of that, but I'll still open a draft PR with the already-done wording changes, because I did them together with changes to how the relevant options are documented in help text, some of which are relevant even in the absence of this, and this requires no extra effort since I was already just about to click the button to create it. I had been saying I'd update the help text for -r and this includes a version of that.

It may be that no part of those changes will be worthwhile and I will have no objection to the PR being closed without merging, which I may even manage to do shortly myself if the situation is clear enough. If some part of them are good then they can either be included elsewhere (and the PR still closed) or the PR can be modified appropriately.

Actually, this was fixed by #1470.