neithernut/git-dit

git-dit gc is slow

matthiasbeyer opened this issue · 11 comments

Hi,

as I already told you in private, I am playing with importing all issues from imag into git-dit.

I created git-dit-github-import-rb and imported the issues. Now I can git dit gc --dry-run. It took way too long.

I have about 8500 commits created by the import, git dit gc --dry-run | wc -l prints almost 1900 refs which will be garbage collected, but it takes 51 seconds (in debug build) to complete the dry-run, which is too long IMO.

Also it blocks until all refs are found and prints them after the aggregation. I don't know how much that might be improvable, but we should consider printing as early as possible here, IMO.
Also, some status output would be really nice, for example using indicatif, which is a really nice crate for progress reporting IMO.

Maybe we can add parallelization in the garbage collector implementation to speed things up a good bit.

IIRC, the garbage refs are collected because of a mixture of reducing code duplication and another Result<Iter<Result>>. The collection code itself should return an iterator which may be used directly.

However, this does not change a thing about the GC being slow. I'll try to have a look soonish.

Note: while making the GC code capable of doing things in parallel might be of value in the future, I'd rather figure out why it's slow in the first place, because it shouldn't be.

Maybe we are not to blame, after all: libgit2/libgit2#4428
Note: this is just a speculation, I'll make proper measurements some time in the future.

@matthiasbeyer do you have a repo somewhere for reproduction/measurement?

I have a repository but I would only share it privately to you.

I obtained a repo for testing from @matthiasbeyer. I can somewhat reproduce the (low) performance observed by him:

$ time git-dit gc --dry-run
real	0m29.135s
user	0m10.965s
sys	0m17.705s

According to perf, nearly 80% of the time is spent in git2::repo::Repository::references_glob or git_reference_iterator_glob_new (the latter is the libgit2 function invoked by the git2 abstraction).

This is clearly a problem for us, both because we inherently have lots of refs and there are not so many reasonable alternatives to retrieving the refs via globbing. I also suspect that libgit2 doesn't have much room for improvement itself or that it is reasonable for the devs to improve the performance at all cost. Git is not designed for many references.

I'm not yet sure how to solve this problem. Ideas are welcome!

Maybe we could do some measurement how fast git gc is, first. If it's in the same space, I'd rather mark this as non-issue.

Note that this case is extreme: We should run the GC more often than after 8k messages or whatever it was. Calling the GC on a small repository (tens or hundrets of refs to collect) this should be way faster.

All in the context "I have not tested how fast it is if it has nothing to collect", of course.


What I just remembered I should note as well: I have slow discs. I have 5400rpm discs in my machine... parallelization on our end could help with FS access, I guess.

It may help to restrict collecting to some issues as the space of refnames to consider is much smaller. That's one of the reason why I added support for this to git dit gc and (iirc) proposed crafting a hook using this functionality.

However, if the number of refs is huge, you will always run into this problem. Sure, we may be able to parallelize globbing: collect references for multiple issues in parallel, but the underlying problem remains: globbing is slow.

Maybe we can take advantage of the fact that we don't really need globbing but only directory-listing. This would, however, lead to a completely separate implementation for ref-discovery if not provided by libgit2 itself.

Turns out libgit2 iterators over all references and matches each one against the glob provided. Hence, we end up with a runtime proportional to #issues x #all_references. I'll try to introduce an optimization to libgit2 specific to the FS-backend.

Regardless whether the optimization will be accepted or not, it would be reasonable to modify the garbage-collection so it will also work with current versions of libgit2 and other backends: instead of querying the references for each issue, we would iterate over all references and filter them based on the selected issues and GC-parameters. We should also put a big sticker on it saying why we do it this way.

As I already explained, the GC is slow with lots of issues because it scales a little worse than O(number_of_issues^2), because globbing leads to processing of each reference in the repository. For FS-based repo, there is an optimization (libgit2/libgit2#4629) which will, probably, soon be merged and be available via git2-rs at some point in the future. We'll "just" have to wait for that and then update the git2-dependency. However, we can also do a bit more than waiting:

The latest refactor of the GC in itself did not make collecting references faster. We could spare us some globbing, but only at the expense of effectively building our own in-memory refdb.

However, it did enable printing/processing the references collected for each issue directly, without an intermediate collection step. In other words, we can now tweak git dit gc so that it doesn't sit there silently for eternity and suddenly spouts all the refs it accumulated but spits out what's availible in smaller chunks. I'll do that soonish, together with a few other changes.

Another thing we can to is building a pipeline where CollectableRefs::for_issue() is run in another thread. This way we can collect references (and maybe find the next issue) while for_issue() is busy globbing for refs and figuring out whether the local head can be collected. Implementing such a functionality is relatively easy using stuff from std::sync::mpsc. It appears that Revwalk implements Send, so we could easily implement Send for RefsReferringTo.

There is also a neat little crate just for pipelining (ppipe). However, we cannot run multiple worker threads if we use that. And we may want to do so, eventually.

Bah. I misread, Revwalk does not implement Send. This, of course, complicates things a bit. I bet there are synchronization wrappers somewhere in the stdlib, but I did not see them, yet. We can just wrap the RefsReferringTo in a Mutex.