alerque/git-warp-time

long runtime (likely due to some factor with our repo)

Closed this issue · 17 comments

We weren't smart enough to time the initial run, but we did time another run and it took a little under 70 minutes. Since that's so much higher than what you mentioned in the parallelization thread I wanted to go ahead and report it up.

The repo we tried this on might just be a pathological case, but perhaps knowing what our repo looks like will help identify optimizations.

We have a network share that contains a bunch of videos and related assets:

  • I think it's a bit less than 3300 videos. These are hefty. I think each video can range from around 1-100 GB. Total folder size is a little over 13 TB.
  • For each video we also have 6 or 7 related files (audio extracted from the videos, a few stages of captions with auto and human edits applied). I think we can squint a bit and call this roughly 25k files. Aside from the audio files, these are all pretty small.
  • We do most of the work on these from a newish apple silicon mac studio which mounts this volume with smbfs. The mac studio and storage server are in the same office with a 10 Gbps network link.
  • The storage is backed by 12 hard drives with SSDs for caching. We haven't really benchmarked it, but think it's capable of at least 500MB/s read.
  • The root of this network share is a git repo that only the caption files have been committed to. This is 4 files per video, or around 13k files. None of the large (video/audio) files are committed.
  • Total size of the .git directory is around 44 MB, and the current checked-out size of all files that are committed is around 157 MB.
  • Repo history is pretty shallow at 54 commits.
  • If it matters, some of these are big commits with many new files or auto-edits at once, and then there are smaller fixup commits changing fairly few paths.

70 MINUTES‽‽ Hang on while I retrieve my monocle from wherever it popped out to before reading on...

Something bad is going on here. Nothing outside of the Git tracked files matters, so the 13 TB is moot. In fact the size of tracked files shouldn't matter either as only meta data is being examined. The super shallow history means there should be relatively little tree walking needed for each file. 10-20k files with at most about a 50 commit depth shouldn't be even close to the size/depth of some other repos I've run this on.

The obvious difference here is networking and specifically network file systems. There is likely some slow API call being used that is having to make multiple network round trips, perhaps for each step in the history walk. This is probably something we could find and work around, but the ultimate culprit will probably be a network file system operation that somebody didn't plan on being used in bulk.

Out of curiosity the first thing I would try to debug this is making a full copy of the .git directory somewhere on a local disk with a local file system, then using export GIT_DIR to toggle to using that as the source of history data / object storage. That way you could eliminate about half the variables and hone in on which half to keep looking further in.

Another thought is it could be the touch operations themselves. It might be that meta data writes are clogging up some COW buffer and the remote file system is taking a long time to confirm that the operation happened. You might be able to simulate this by doing a bunch of touches to a local file with random timestamp values vs. a bunch to a remote file and see if there is an order-of-magnitude scale difference.

P.S. You guys know about git-annex?

Something bad is going on here. Nothing outside of the Git tracked files matters, so the 13 TB is moot. In fact the size of tracked files shouldn't matter either as only meta data is being examined. The super shallow history means there should be relatively little tree walking needed for each file. 10-20k files with at most about a 50 commit depth shouldn't be even close to the size/depth of some other repos I've run this on.

Thanks for ruling these out. I hoped they'd wouldn't matter, but I wasn't sure if you'd have to run a ~negative search for present files that were previously tracked.

I tried running gwt against a clone of the repo on a 2020 intel macbook air (just the repo, none of the other files) and it took about 2 hours. (I probably could've helped this along by keeping it cooler. The SSD on this is peppy, but I watched this run in activity monitor for a while and the reads were generally between like 8 and 20K per tick. I think it had run about 40 minutes when I stepped away and gwt's process had only only read about 9 MB total at that point.)

Once it became obvious that the above was going to take quite a while, we started to suspect that this was either the repo or macOS. We cloned it onto a macbook pro (M2), nuked the .git directory, ran git init again, and recommitted all of the files in a single commit. gwt took 1m22s.

I guess there's something screwy in the repo itself. (I didn't think to make a point of remembering it, but IIRC the fresh .git directory in this case was about 39 MB, as opposed to the ~44 MB we were seeing in other copies.)

We also left a gwt running against a clone of the repo on WSL. It wasn't done yet, but it had already run for long enough that we assume it'll be similarly long.

It sounds like we need some actual profiling. I don't know where that time is going. I expect some pathological cases (like the Chromium source repo or Nix's package mono repo) to take forever because when you combine millions of commits depth plus hundreds of thousands of files the blunt force single thread history walking approach is going to suck. But also anything over a few minutes for pathological cases sounds like too much to me.

So currently we are:

  1. Listing all currently tracked files.
  2. Iterating each one.
  3. Walking through the history to find the most recent commit touching each file.
  4. Touch the file.

Pretty obviously this has an N×N problem. What we probably need to do is:

  1. List all currently tracked files.
  2. Walk back through commit history listing files touched by the commit.
  3. Touch any files we haven't already touched.
  4. Remove anything we just touched from the list we're still looking for.
  5. Stop walking history when all current files have been touched.

That should reduce the amount of Git meta data calls being made by a multiplier of however many files there are.

I'm less sure how that algorithm could me parallelized (as in #9) but the number of files tracked in most repos is going to be a much bigger multiplier than the number of CPUs available on most systems, so it's probably a better place to start.

(This still doesn't resolve in my mind why you're having so much trouble with such a shallow depth repository, but it's something concrete that needs doing...)

Can you time something on a manually input list of one file vs. 10 files vs the whole repo (using rough times is fine such as the ones you already have) and see if those numbers make sense? If you have 10k files I would expect a manually input list of 10 files to take about 1/1000 the time, so 70m vs 4s or something in that ballpark. Does that guess hold true?

As of v0.6.0 step 2 above triggers parallel processing for steps 3 and 4, which might bring your 70m case down to 10m or so depending on your system. I released it because it is a lot better than before and I wanted to start using it in production myself, but obviously that is still kind of a big number for some repos. I expect there is another couple orders of magnitude we can shave off here, so I'm keeping this issue open for sure.

If you have a chance to answer as to whether processing a small list of files on your repo is a reasonable fraction of what it is for the whole set would still be useful info.

(This still doesn't resolve in my mind why you're having so much trouble with such a shallow depth repository, but it's something concrete that needs doing...)

I'm not sure if it leapt out from the post before, but when we nuked the .git repo and re-inited we did get ~1m22s for a full run of gwt. I was imagining it just meant there's something ~weird/unexpected (but not totally broken) in our .git dir that is inducing a lot of overhead.

But I guess another way to read that is that it's just a difference between a commit depth of 1 and 54. Most of our files were committed in the first few commits, and then broadly the later commits are edits or fixes that touch much smaller fractions. It seems reasonable that the bulk of the history searches go relatively deep.

Can you time something on a manually input list of one file vs. 10 files vs the whole repo (using rough times is fine such as the ones you already have) and see if those numbers make sense? If you have 10k files I would expect a manually input list of 10 files to take about 1/1000 the time, so 70m vs 4s or something in that ballpark. Does that guess hold true?

I'll hesitate to characterize these... :)

1

original smb share:

$ time git-warp-time apples/bananas/branded/c02_e01.srt

real	0m48.251s
user	0m0.538s
sys	0m1.148s

fresh local clone:

$ time git-warp-time apples/bananas/branded/c02_e01.srt

real	0m0.420s
user	0m0.360s
sys	0m0.058s

10

original smb share:

$ time git-warp-time "apples/bananas/branded/c00_e01.srt" "apples/bananas/branded/c01_e01.srt" "apples/bananas/branded/c01_e02.srt" "apples/bananas/branded/c01_e03.srt" "apples/bananas/branded/c01_e04.srt" "apples/bananas/branded/c02_e01.srt" "apples/bananas/branded/c02_e02.srt" "apples/bananas/branded/c02_e03.srt" "apples/bananas/branded/c02_e04.srt" "apples/bananas/branded/c03_e01.srt"

real	0m50.157s
user	0m3.475s
sys	0m1.157s

fresh local clone:

$ time git-warp-time "apples/bananas/branded/c00_e01.srt" "apples/bananas/branded/c01_e01.srt" "apples/bananas/branded/c01_e02.srt" "apples/bananas/branded/c01_e03.srt" "apples/bananas/branded/c01_e04.srt" "apples/bananas/branded/c02_e01.srt" "apples/bananas/branded/c02_e02.srt" "apples/bananas/branded/c02_e03.srt" "apples/bananas/branded/c02_e04.srt" "apples/bananas/branded/c03_e01.srt"

real  0m3.357s
user  0m3.285s
sys  0m0.065s

~1000 (936--largest single folder I think):

original smb share:

$ time git-warp-time "zebras/yellow/unbranded/"*
Error: Custom { kind: InvalidInput, error: "Paths [... errors for 704 untracked files the wildcard matched ...] are not tracked in the repository" }

real	0m50.925s
user	0m0.169s
sys	0m1.328s

fresh local clone:

$ time git-warp-time "zebras/yellow/unbranded/"*

real	4m36.432s
user	4m35.719s
sys	0m0.465s

Unsure why this last one's so much slower than the smb share, but we did try it twice. Would the invocation against the smb share be bailing out ~early due to the errors?


This may just be pattern-matching run wild enough to border on numerology, but this fairly consistent ~48-second overhead on the smb share made me a little curious...

There are 13 top-level directories in the repo, and 88 total directories in the repo (with relevant files 2-4 folders deep). If there were some 47-seconds of overhead per directory for some reason, it would come out to a smidge under 69 minutes.

There are 13 top-level directories in the repo, and 88 total directories in the repo (with relevant files 2-4 folders deep). If there were some 47-seconds of overhead per directory for some reason, it would come out to a smidge under 69 minutes.

I won't quite say I've ruled this out, but I poked at the idea from 2 more angles that seem to rule out a general problem related to directory count.

reducing file count while keeping directory count steady

I mentioned previously that this took ~2h on a 2020 macbook air. I took a copy of that and knocked out all but 1 set of 4 files in each leaf directory and committed. That reduced the total number of files from about 13k to about 250, and gwt then yielded:

$ time git-warp-time

real	1m16.489s
user	1m16.000s
sys	0m0.390s

I can squint and say the time fraction is close to the file fraction. If there is big per-directory overhead, I imagine it'd either have to be triggered by something purged along with those deletions or related to the smb share.

explicitly specifying 2 files in the same and separate directories

On the same 2020 macbook air, I also confirmed 2 files in the same and separate directory take about the same time:

$ time git-warp-time "apples/bananas/branded/c02_e01.srt" "apples/bananas/branded/c02_e02.srt"

real	0m1.153s
user	0m1.097s
sys	0m0.053s

$ time git-warp-time "apples/bananas/branded/c02_e01.srt" "crepe/dumbbell/branded/c00_e00.srt"

real	0m1.149s
user	0m1.094s
sys	0m0.053s

Tried out the new release this morning.

Here's what I see on the pared-down (~250-file) copy of the repo on the 2020 macbook air:

0.5.2 $ time git-warp-time

real	1m16.489s
user	1m16.000s
sys	0m0.390s

0.6.1 $ time git-warp-time

real	0m36.848s
user	3m44.465s
sys	0m3.127s

Here's what we see on the original repo from the mac studio (your ~10-minute guess was good :)

$ time git-warp-time

real  9m50.232s
user  77m26.145s
sys   31m15.182s

Last night I cooked up another iteration that does a depth-first search and handles all the files in one pass instead of walking history for each one. This is at least one more order of magnitude faster in early testing, probably more when I work out some kinks. Unfortunately the current version basically fails it's job and ends up resetting all timestamps to the root commit time which is not useful, but in doing so it does more work than a correctly functioning loop that breaks after finding all the files, so I'm hopeful we can chop off another 10× time chunk pretty soon.

Okay I have it working again. There is now no parallelism again (mostly because libgit2 is really grumpy about sharing) but the single threaded current iteration is clocking it at consistently 45× faster than the multi threaded v0.6.1. There are definitely some pieces that could be done in parallel but it will require some fiddling to figure out what bits of libgit2 we should side step or run multiple instances for. It will be a much more delicate think to add back now and perhaps with Gitoxide it would go smoother.

In any case another one and a half orders of magnitude while only using one CPI is a step forward, not backward. My fan doesn't spin up and it's done in 1/45th the time as the one that spins up all the cores.

I closed this because I think we're in the ballpark now, but I'm open to looking at it again if your use case isn't manageable. I think there is probably one more order of magnitude to be gained here, but it is in lots of little pieces. Since libgit2 isn't going to make this easy it will either require Gitoxide or very careful collecting values we can use in our own structures through whatever iterators are fastest and then going through them in memory with our own segmentation.

I'm certainly open to this (especially if somebody contributes to it) but my motivation is waning now that my biggest regular use cases are down from at most 40 seconds to tenths of a second.

My guess is your case that came down from 70 min to 10 min range with v0.6.x should now by in the 15 second range with v0.7.x. Let me know if that doesn't hold true.

This looks great.

I won't have access to the 70m system until Monday, but I did give this a try on the 2020 air and agree that this is almost certainly good enough to merit closing.

Here's the clone I thinned down to ~250 files:

$ time git-warp-time

real	0m1.128s
user	0m0.753s
sys	0m0.060s

Here's the full clone that was formerly taking ~2h:

$ time git-warp-time

real	0m6.608s
user	0m6.377s
sys	0m0.167s

fiddling to figure out what bits of libgit2 we should side step or run multiple instances for

Sounds familiar :)

In the process of making https://github.com/abathur/lilgit I used just a little of both the Python and Rust bindings and found it to be more or less a coinflip whether it'll handle something faster or way slower than the git CLI. Been intending to try gitoxide out (I subscribed and get the voluminous release notifications...) as it matures to see if it helps much on that point.

Report from the mac studio / smb system is quite a bit faster than on 0.6.x, but still a few minutes (though my personal system obviously was significantly faster than 15s):

$ time git-warp-time

real  3m9.857s
user  0m4.874s
sys  0m7.180s

Thanks for reporting. I'm not super happy with the 3 minute mark, but given that you shouldn't have to run this except when cloning or switching branches, for a mega repository I'm not going to sweat it too much. It will take me more time to fix than it will ever save you in production ;-) Also for a network file system with lots of meta data access going on I'm not sure what the theoretical limits would be—what kind of time we're leaving on the table. I messed with git CLI tooling tying to get answers to simple questions like "the last commit message to touch this one file" and was getting slower responses than it is taking GWT to reset the whole repo.

That being said I am aware there is much more that could be done and I am interested in playing around with parallelization and/or Gitoxide. I opened #15 that you can subscribe to if you want notifications when the performance situation changes again. PRs also welcome.

I hope you get some millage out of GWT as-is now even if you have to wait a few minutes. For my part I'm quite enjoying that the few minutes I've been semi-regularly waiting for a few projects is now down to sub-second range! I even added it to some checkout hooks than run all the time instead of running it only as a pre-build script. Maybe someday we can get your somewhat harder case down another big chunk too.