Git history on trees and blobs is slow.

Question

Git history on trees and blobs is slow.

redrabbit opened this issue 6 years ago · 1 comments

When showing the history of a given tree or blob (see #31), things can get pretty slow. Commit 81a75c0 helped marginally but the computation of such an action remains very intensive.

Testing with the current HEAD of elixir-lang/elixir (~15.000 commits):

Rendering the paginated history of master take ~140ms.
Rendering the paginated history of master with a pathspec takes ~6.000ms.

Fetching the history for a given revision is done relatively fast. In contrast, having to filter for a given pathspec takes forever.

Both methods could be greatly improved if pagination does not require to compute the entire revwalk (we currently need to Enum.count/1 the entire revwalk to show the last available page).

Instead we should provide a pagination similar to one specified by Relay Cursor Connections. We loose the possibility to show the number of pages, in return we can start the revwalk from a given commit and Stream.take/2 a small number of commits.

Answer 1 · 2018-11-09T22:01:07.000Z

Commit a20d335 adds a significant boost. With a big repository we can go from >6s down to 150ms.
But this also depends a lot on which blob/tree is used.

Explanation

The history of a file being changed frequently will render much faster than a file changed more rarely.

For a Git commit history of a tree or blob the cost intensive part is that we have to filter each commit to ensure that the given tree/blob has changed. This means

Loading the tree of the commit and its parent.
Diffing both trees.
Matching the given pathspec.

Having to process the entire 15.000 commits takes relatively long. The new paginate_cursor/5 function helps by skipping as much commits as possible...

Implementation

Here's a basic usage example for paginating commit history:

page = paginate_cursor(@conn, @commits, &(oid_fmt(&1) == &2), &oid_fmt(&1.oid))

The performance improvement is due to the fact that

We don't go through the entire list of items. Instead we skip (Enum.drop_while/2) items until we match (third param) the cursor. Having the stream's :enum trimmed until cursor.
We Stream.take/2 until we aggregate limit+1 items (or reach the end of the stream).
We create a new cursor (fourth param) pointing at the previous/next page item.

They are also a few caveats:

We use the :enum field of Stream{} to drop items until fn_filter returns true.
- This is a bit of a hack because it relies on GitRekt.Git.enumerate/1 to resolve the first layer (NIF) of the stream. We use this to make sure that libgit2 iterators behave more Erlangish.
If we paginate the history of a file that is modified relatively often, we will aggregate the limit items quiet fast. But if we have a file that is rarely modified, we might have to process a very big chunk of the full commit list in order to create that page of limit items.