Write changed-path Bloom filters during commit-graph write
derrickstolee opened this issue · 16 comments
The upcoming version of Git includes changed-path Bloom filters which speed up history performance.
We need to respond by updating the commit-graph maintenance step to write these paths.
Some research is required to determine how long this takes to update the commit-graph in a deep history. Should we update Git to have a maximum batch size?
Instead of having this change inside Scalar, I would prefer something like my current PR in gitgitgadget/git#633 where we introduce something like commitgraph.writeChangedPaths=<boolean>
config in git. This way, in scalar, the only change needed is to set that config to true.
I prefer this approach as it would also benefit non-scalar users who often only rely on fetch.writeCommitGraph and gc.writeCommitGraph to have their performance boost.
I support a commitgraph.writeChangedPaths
config option, but that's not enough for what I need to do here.
I need to do serious performance timings for how long it takes to compute changed-path Bloom filters on a repo with 4 million commits (the Windows OS repo) and see if that's appropriate for an all-in-one write in the background, or if we need to somehow split that computation across multiple steps.
Also, our current incremental commit-graph writes need to be changed. Currently, we have no idea when the layers will collapse and then the --changed-paths
would actually apply to the whole commit-graph chain. So, we need to create a strategy to collapse the layers on purpose to make sure every user has filters on every layer.
Finally, Scalar is supposed to work against all Git versions v2.25.1.vfs.1.1
and later. All of the above investigation needs to be gated behind a version compatibility check. (Although, your config idea does solve that, assuming nothing else needs to change.)
Labeling this issue as stale. There has been no activity for 30 days. Remove stale label or comment or this issue will be closed in 7 days.
Removing no-activity
label because @ttaylorr has been doing some critical work in Git to make this possible.
because @ttaylorr has been doing some critical work in Git to make this possible.
I am? :). It is the case that --split=no-merge
will help avoid blow-ups in computation time when the commit-graph machinery decides to collapse some layers. There are some bug-fixes in my tb/bloom-miscellaneous
branch that should help make the rest of this possible.
because @ttaylorr has been doing some critical work in Git to make this possible.
I am?
I'm specifically talking about --max-new-filters
in your tb/bloom-miscellaneous
, which unblocks this work.
Labeling this issue as stale. There has been no activity for 30 days. Remove stale label or comment or this issue will be closed in 7 days.
https://github.com/git/git/blob/master/Documentation/git-commit-graph.txt#L72-L77
https://git-scm.com/docs/git-config#Documentation/git-config.txt-commitGraphmaxNewFilters
The flag has made it to master a while ago... but other than searching the mailing list, I don't see what is the recommended value of this setting or how can we use it
Since Scalar is moving onto the new git maintenance run
command (and git maintenance start
eventually, too) we will want to engage with the commit-graph
task to do this properly.
My current thoughts:
- We already have the detection of existing filters to determine if we should continue writing filters. This means that if a user runs
git commit-graph write --changed-paths
then future maintenance runs should continue writing those changed paths. - If a user sets
commitGraph.maxNewFilters
, then that will be respected during maintenance. Users will likely want to set it to be a small value so the (default) hourlycommit-graph
task doesn't take too long. - There are some questions as to how to make a very large repo "catch up". If we rely only on the
commit-graph
task, then we are waiting for the default split commit-graph merge strategy before computing filters for old commits. Pair that withcommitGraph.maxNewFilters
setting, then we might be waiting a very long time before large repos actually benefit from this setting.
We haven't had time to do the necessary testing to know what is a good strategy for computing these filters on clients in the background without disrupting their foreground activities. A fleeting idea I've had is to create a commit-graph-full
task that flattens all layers of the commit-graph, perhaps at a daily or weekly frequency. That task could also write changed-path filters by default, along with a recommended default for --max-new-filters
(being aware of commitGraph.maxNewFilters
, if set).
cc: @ttaylorr for his thoughts here.
I wonder if this question should goes to the mailing list instead:
It's not apparent to me how to determine the value of commitGraph.maxNewFilters
by reading git official documentation, or what does it even do.
It would takes a bit of time digging through the git mailing list to find a magic max value of 512 (not clear whether this is inclusive <= 512
or exclusive < 512
.
Given a repo, how would one calculate what would be a sensible default value for this config?
It's not apparent to me how to determine the value of
commitGraph.maxNewFilters
by reading git official documentation, or what does it even do.
The value of this configuration is how many Bloom filters a git commit-graph write
will compute from scratch before stopping. To write changed-path Bloom filters, we first have to compute them, which is really just doing a first-parent tree diff on all of the commits that are going into the commit-graph layer that we are writing.
Since computing lots of tree diffs can be expensive, it may sometimes make sense to limit the number of them that can be computed. If such a limit is in place and exceeded, Git will write the "uncomputed" filter, which indicates to a future run that Git hasn't tried to compute the CPBF belonging to that commit.
Note that this only applies to the current layer being written, i.e., a previous layer's "uncomputed" filter does not get computed in a subsequent write unless we are combining with that layer (e.g., with --split=replace
or via an auto-merge).
It would takes a bit of time digging through the git mailing list to find a magic max value of 512
I'm not sure what you're referring to. Could you link what you are looking at?
Given a repo, how would one calculate what would be a sensible default value for this config?
There is no single magic recommended value. If you are interested in setting this value, you should run a git commit-graph write
and see how long it takes to generate changed-path Bloom filters (you can do this easily with GIT_TRACE2_PERF=1
in your environment). This is all highly dependent on the trees in the layer that you are writing.
3. There are some questions as to how to make a very large repo "catch up". If we rely only on the
commit-graph
task, then we are waiting for the default split commit-graph merge strategy before computing filters for old commits. Pair that withcommitGraph.maxNewFilters
setting, then we might be waiting a very long time before large repos actually benefit from this setting.
Right; but I think that this is probably fine. I would assume/hope that the normal maintenance task accumulates no more commits than it can comfortably compute Bloom filters for and write into an incremental layer. Setting a high-but-not-unreachable limit would be a good way to ensure that not too many commits occur between maintenance runs.
Is there an "expensive" maintenance run that rolls up all of the accumulated layers into one? If so, maybe setting a higher limit (or none at all) makes sense for that run. Otherwise, if you don't run with --split=no-merge
, then it probably makes sense to stick with the same limit and use it even when we are catching up and computing Bloom filters that were skipped during earlier runs.
Ah, the 512 number is the maximum SIZE of one filter, which is indeed a magic number. There is no way to modify this value outside of GIT_TEST_*
environment variables (intended only for testing).
As @ttaylorr mentions, commitGraph.maxNewFilters
is a different number that defaults to "compute all filters for every commit I'm writing to the commit-graph".
Labeling this issue as stale. There has been no activity for 30 days. Remove stale label or comment or this issue will be closed in 7 days.
I can't un-no-activity
this issue, but this shouldn't be closed since the maintenance builtin doesn't yet set commitGraph.maxNewFilters
.
In the case of "no-activity" It's not "closed because done" but instead "closed because fell off the back of the backlog." We have a different item tracking this internally, so I'll close this one.