git_filter

This is a utility for filtering a git branch.

What, how, why?

It outputs a new branch with only the things we want included, in all revisions. It writes a new history preserving the part of it which is relevant for the new branch. It resulted from the efforts to do the same thing using git filter-branch and then directly using git plumbing commands when git filter-branch was found to be too slow. On my test repository with 100k commits the original git filter-branch based filtering took around 2 days to finish one filter set. The same thing with git_filter takes a minute on the same hardware.

The input is a positive list of files and directories to be included.

Advanced features

The tool can make several filterings simultaneously. This is normal, of course, where you want to split a large git up into smaller ones and you want two or more disjoint sets of data which together contain the whole original repository.

git_filter produces a lot of loose objects when it is finished, so it is a very good idea to repack the repository when it is done (e.g. git repack -ad) before continuing working with the resultant repository.

In addition to the new branches git_filter outputs a .revinfo text file per branch with a line per new revision showing correspondance to the original revision it is derived from. The purpose of this is to allow recreation of tag information.

The purpose of the git_filter program for me was to generate final repositories which contain none of the original commits. To do this I needed to do some further work.

The push_clean_repos script creates a clean repository for each of the filtered branches generated by the git_filter run. Each new repository has the same name as the corresponding branch. It takes the same configuration file as argument as git_filter. The newtags.py uses the .revinfo files from git_filter and tag information in the source repository to map the tags in the source to each of the destination repositories.

An example

I have a git repository repo I want to split up. It is located in the current directory.

Then run:

./git_filter git_filter.cfg && ./push_clean_repos git_filter.cfg

git_filter saves the necessary state (in the .git directory) to allow a full history processing to be resumed without generating all the initial commits again. We can run it once on the entire history and then run it incrementally on new commits and produce the same result as starting from scratch each time. This results in much shorter processing times. Tell git_filter to do this by adding the option continue on the command line after the configuration file, thus:

./git_filter git_filter.cfg continue

Building the script

Just a plain

make

should be enough to build the git_filter. It automatically downloads libgit2 and builds it as part of the process. It has been tested to compile on Mac (with Xcode installed) and on Ubuntu Linux. Neither of these systems had a pre installed libgit2.

Config File Syntax

Look at the filter.cfg example, it is commented.

Config items and data

The config file parser is very simple, so a single space is the only allowed separator. The parameter names should be exactly 4 characters followed by colon and a space. Lines beginning with a # are comment lines and are ignored.

REPO: <repo>

The configuration file should contain one REPO tag with the location of the repository to filter.
REVN: [range|ref] <refspec>

A revision specification. Either a range, e.g. master~1000..master or a (branch) reference, e.g. refs/heads/master.
BASE: <dir>

A base directory for the filter file lists.
FILT: <name> <file>

Space separated name and filter file pair.
TPFX: <tag prefix>

The prefix for tags and output repo names, prepended to the filter set name.

License

GNU GPL v2

tmannerm/git_filter