Author: Alberto Cortés
Date: 2017-03-21
Originally published here.
"Laurel forest, La Palma © Jasper Arends, Flickr Creative Commons"
Do you know how Git identifies what files have changed between two commits?
It is a fairly common operation you probably do every day when you review pull requests, when you check staged files... It involves prefix trees and Merkle trees and many comparisons.
In this blog post I will walk you through every step in the process in an intuitive way; I will skip some of the hairy details so I can focus on the main problems and solutions, and present them in a clear and understandable way.
This blog post was inspired by my work on go-git, which in turn was inspired by Git itself and libgit2.
I will use snippets of Go code here and there to illustrate some concepts, they all are simplified versions of the real Go code we use in go-git.
Whenever you make a Git commit, you save a snapshot of your project into your repository, that is, Git remembers the state of all the files you have staged and stores all that information in a tree structure inside the commit.
This tree structure is composed of nodes that represent your files and
directories. I call this kind of tree a MerkleTrie
, as it has properties of
both prefix trees (tries) and Merkle trees, let us get into the details!
In a Git tree every node has a name, which is the filename portion of the path to the file or directory it represents, as in the figure below:
An example of a Git tree with some names in their nodes. The names of the nodes are shown between double quotes.
The practical implications of this are:
-
The root has the empty string as its name.
-
The name of a node is not its full path, and does not provide enough information to know where the node is located in the tree.
-
The path of a node is the list of its ancestors, starting from the root and ending with the node itself. For example, the path to
lib/lib.go
will be the list of nodes containing the root node,lib
and finallylib.go
itself. -
The children of each node can be lexicographically sorted by their name:
LICENSE
comes beforelib
,lib
comes beforemain.go
. Similarly,lib.go
always comes beforelib_test.go
. -
The depth-first traversal of a tree which visit nodes in the same directory in lexicographic order enumerates them the same way as the
tree
command line tool:LICENSE lib lib/lib.go lib/lib_test.go main.go
Every node has a SHA-1 hash in a Git tree:
-
The hash of a file is calculated from its size and contents.
-
The hash of a directory is calculated from the mode, name and hash of its children, in lexicographic order.
If we add the hash information belonging to each node, in blue, to the previous figure, we get a more detailed view of the tree:
An example of a Git tree with some names and hashes in their nodes. The names of the nodes are shown between double quotes, the first bytes of their hashes are shown in blue.
This means Git trees are Merkle trees, the practical implications for us are:
-
Files with different contents have different hashes. This means it is pretty fast to find out if a file has changed between two commits, as you only need to compare their hashes, not their contents.
-
Two files with the same content have the same hash, no matter if they have different names and/or permissions. If you index contents by hash, you only need to store it once, no matter how many times it appear in the history of the repository.
-
If two directories have the same hash, then they have the same children: same files and directories with recursively the same names, modes and contents.
This is a huge time saver for comparing directories between commits, because if their hashes match, you know they have not been modified; there is no need to check their children individually.
Consequently, the worst case algorithmic complexity of checking two directories for equality is reduced from O(n) to O(1) (with n being the number of descendants of the directory with less descendants).
-
If some change has been introduced in a commit, no matter how small it is, all the ancestor directories of the modified file will have their hashes modified.
Let us assume, for the sake of this blog post, that hash collisions are impossible, even though they are quite popular nowadays.
In line with what we have explained so far, this is a sensible representation of Git trees in Go:
type Noder interface {
Name() string
Hash() []byte
Children() []Noder
}
The path to a node in a tree will look something like this:
type Path []Noder // beginning from the root
// and ending with the node itself
I recommend Path
to implement Noder
so that you can treat paths as
nodes when needed:
// assuming the receiver is not the empty slice…
func (p Path) last() Noder { return p[len(p)-1] }
func (p Path) Name() string { return p.last().Name() }
func (p Path) Hash() []byte { return p.last().Hash() }
func (p Path) Children() []Noder { return p.last().Children() }
Let A and B be two Git trees from consecutive commits. If we compare one with the other there are going to be quite a few changes we have to deal with, let us see a minimal set (changes depicted in red):
-
Modified file: Both trees have the same topology, but the hash of the file is different from A to B, along with the hash of all its ancestors, as illustrated in the figure below:
-
Renamed file (the name of the file has changed, but it has not been moved to a different directory): Both trees have the same topology and the hash of the file is the same in A and B, but the hash of its ancestors is different in each tree.
-
Inserted file: The topology of A and B is different, the parent directory of the file has a new entry in B and the hash of all its ancestors changes.
-
Deleted file: The topology of A and B is different, the parent directory of the file has a missing entry in B and the hash of all its ancestors changes.
In general, if we compare two Git trees we are likely to face several occurrences of the cases above, and maybe even some composite cases, like moving a file to another directory or replacing a file with an empty directory with the same name.
The output of our tree comparator will be a list of changes, that applied to the A tree will turn it into the B tree. A change in this context is defined as an action and the involved paths. We can represent them as follows in Go:
type Action int
const (
Insert Action = iota
Delete
Modify
)
type Change struct {
Action Action
From Path // to the modified node in A or nil if Action is Insert
To Path // to the modified node in B or nil if Action is Delete
}
Here are some examples of how to represent a few popular kinds of changes between two trees:
-
Inserting a
lib/foo.go
file will be represented as a single insertion action, fromnil
to the path of the new file:fooDotGo := ... // the node for the 'foo.go' file lib := ... // the node for B's 'lib' directory (contains the new fooDotGo) root := ... // the root node (contains lib) insertion := Change{ Action: Insert, To: Path{root, lib, fooDotGo}, } changes := []Change{insertion}
-
A modification of the contents of
lib/lib.go
is represented as the single change too, from A'slib/lib.go
to B'slib/lib.go
:libDotGoA := ... // A's 'lib.go' file libA := ... // A's 'lib' directory rootA := ... // A's root node libDotGoB := ... // B's 'lib.go' file (different hash than libDotGoA) libB := ... // B's 'lib' directory (different hash than libA) rootB := ... // B's root node (different hash than rootA) modification := Change{ Action: Modify, From: Path{rootA, libA, libDotGoA}, To: Path{rootB, libB, libDotGoB}, } changes := []Change{modification}
-
Renaming
lib/lib.go
tolib/foo.go
needs two changes: deleting A'slib/lib.go
and inserting B'slib/foo.go
.libDotGo := ... // A's 'lib.go' file libA := ... // A's 'lib' directory rootA := ... // A's root node fooDotGo := ... // B's 'foo.go' file libB := ... // B's 'lib' directory (different hash and children than libA) rootB := ... // B's root node (different hash than rootA) deletion := Change{ Action: Delete, From: Path{rootA, libA, libDotGo}, } insertion := Change{ Action: Insert, To: Path{rootB, libB, fooDotGo}, } changes := []Change{deletion, insertion} // in any order
Alternatively, you can also have your own
Rename Action
and report this as the single change from the old path to the new path, but that complicates the tree comparison algorithm (described in the next section). In any case, it must be pretty easy to detect renames and copies in a later step simply by processing the original output.
Now that we know how to represent changes between two trees, we can already write down the signature of our difftree function:
func DiffTree(a, b Noder) []Change
That is, the DiffTree
function takes two trees indicated by their roots, compares
them and returns the collection of changes needed to turn one into the other.
Detecting insertions and deletions can be implemented by simultaneously walking
both trees using the lexicographic depth-first traversal and comparing the paths
of the returned nodes. The algorithm can be nicely summarized with the following
animation that shows how to detect the insertion of lib/foo.go
.
Simultaneously walking both trees.
If both walkers are on the same path, the element is neither inserted nor deleted and we can advance both walkers to their respective next nodes. We still need to compare the hashes of both paths though, just in case the contents have changed; if the hashes differ, the modification change from the old path to the new one is issued.
If the walkers are pointing to different paths then either the old path was deleted or the new one inserted, depending on which one comes first in the lexicographic depth-first sorting.
I suggest adding the Compare
method to the Path
type
to find out which path comes first. That method returns whether the path comes
before or after another under the particular sorting policy.
This path comparison deserves some careful planning since the naive comparison
of string representations of both paths would return wrong results in some corner
cases. For instance a/z.go
should come before a.go
but the former is
greater than the latter when they are compared as simple strings. The proper way
to compare paths is to compare ancestor names between them, making one step at
a time:
// Compare returns -1, 0 or 1 if the path p comes before, at the same time or
// after the path o, in lexicographic depth-first order; for example:
//
// "a" < "b"
// "a/b/c/d/z" < "b"
// "a/b" < "a/b/a"
// "a/a.go" < "a.go"
func (p Path) Compare(o Path) int {
for i := 0; ; i++ {
switch {
// there are no more elements in any of the paths
case len(o) == len(p) && i == len(p):
return 0
// there are no more elements in o
case i == len(o):
return 1
// there are no more elements in p
case i == len(p):
return -1
// both paths still have elements, compare the ith one
default:
cmp := strings.Compare(p[i].Name(), o[i].Name())
if cmp != 0 {
return cmp
}
}
}
}
The algorithm described above can be summarized as:
-
Start traversing both trees at the same time.
-
While there are nodes left to visit in both trees:
Compare the names and/or hashes of the elements pointed to by both paths and decide:
-
What changes must be issued, if any.
-
How to advance the walkers: the first, the second or both.
-
-
Issue changes to delete the remaining nodes from A or to insert the remaining nodes from B.
Implementing this algorithm involves writing:
-
A tree iterator that traverses trees in lexicographic depth-first order, returning the path to the nodes.
-
A comparator that takes the name and hash of the current paths of two tree iterators and tells us what changes we should issue and how to advance the iterators further.
Let us see how to implement each of them.
This would be a regular depth-first tree iterator if it were not for some interesting details:
-
Siblings are to be iterated in lexicographic order by name.
-
It must be possible to skip whole directories instead of penetrating deeper into them. This is important for efficiency reasons, when the hashes of two directories are the same, we would like to skip all their contents.
Taking those into account, the iterator type will look something like this:
type Iterator interface {
Next() Path // skips directory contents
Step() Path // descends into directories
}
Both Next
and Step
methods return the path of the next node in the tree,
but each of them traverses the tree using a different policy:
-
Next
returns the next path without getting any deeper into the tree, that is, if the current node is a directory, its contents are skipped. -
Step
returns the path to the next node, getting into directories if needed.
I have chosen the names for Next
and Step
in honour of the
gdb commands.
Here is how I recommend to implement such an iterator:
-
The path to the current node should be represented by a stack of frames, the root frame being at the bottom and the frame with the current node being at the top.
-
Each frame is a stack of siblings sorted in the reverse lexicographic order by name so that the "smallest" node is at the top.
-
This means the current node is the top node of the top frame.
-
Next
drops the current node and sets its next sibling to the new current node by popping the top frame. If the top frame ends up empty, just repeat the same operation recursively to remove all empty frames as we climb the tree towards its root. -
Step
behaves just likeNext
for files, but for directories is quite different: it pushes a new frame with the children of the directory.
The following animation shows a series of Next
and Step
method calls over
a tree along with the states of the main stack and its frames:
Iteration demonstration.
An unmodified file. Both paths point to the same unmodified file.
The path in both trees have the same name and hash in the figure above, because they represent the same unchanged file. No change has to be issued and we can proceed with both iterators to the next comparison.
If they were directories instead of files, we would still want to advance both iterators but skip their contents to save some time.
An inserted file. A's path is "bigger" than B's path.
In the case of an inserted file like the one in the figure above, the name of
the paths will be different. The path of the A tree iterator points
to lib/lib.go
and B's iterator points to lib/foo.go
.
As A's path is "bigger" than B's path (comes later in the lexicographic
depth-first sort), we know that B's path was inserted, otherwise A's
iterator would have already reached a similar file. This means we need to issue
the insertion and advance the B's iterator while keeping A's iterator intact,
since the fate of lib/lib.go
has not been decided yet.
If we keep doing this for all the possible cases we end up having the following
table, with a
and b
being the current paths of the iterators and iterA
and
iterB
being the iterators themselves:
Description | Check | Actions | Movements |
---|---|---|---|
inserted file or dir |
a.Name() > b.Name() | insert b recursively | iterB.Next() |
deleted file or dir |
a.Name() < b.Name() | delete a recursively | iterA.Next() |
same file or dir |
same name && same hash |
nothing | iterA.Next() iterB.Next() |
modified file | same name && different hash && both are files |
modify a into b | iterA.Next() iterB.Next() |
deleted a file and created a dir with the same name or vice versa |
same name && different hash && one is file the other is dir |
delete a (recursively) insert b (recursively) |
iterA.Next() iterB.Next() |
add contents to an empty dir |
same name && different hash && both are dirs && a is empty |
insert b recursively | iterA.Next() iterB.Next() |
deleted all contents form a dir |
same name && different hash && both are dirs && b is empty |
delete a recursively | iterA.Next() iterB.Next() |
changed the children of a dir |
same name && different hash && both are dirs && none is empty |
nothing | iterA.Step() iterB.Step() |
Here I wrote down all the possible combinations of nodes and situations you may encounter in paths comparisons and carefully arranged the combinations with similar outcomes in groups so they can be straightforwardly detected with simple checks in your code.
Comparing Git trees becomes an easy task once you split it into the comfortably small pieces.
Even if you don't plan to implement your own version of Git, I hope that understanding the challenges and the proposed solutions at a conceptual level will help you solve similar problems in the future.
P.S. Thanks to my reviewers, Miguel Molina and Vadim Markovtsev for their suggestions, proper English and clean code; the broken English and the convoluted code is all mine.