This is a utility for filtering a git branch.
It outputs a new branch with only the things we want included, in all revisions.
It writes a new history preserving the part of it which is relevant for the new
branch.
It resulted from the efforts to do the same thing using git filter-branch
and then directly using git plumbing commands when git filter-branch
was found to be too slow.
On my test repository with 100k commits the original git filter-branch
based filtering took around 2 days to finish one filter set. The same thing with git_filter
takes a minute on the same hardware.
The input is a positive list of files and directories to be included.
The tool can make several filterings simultaneously. This is normal, of course, where you want to split a large git up into smaller ones and you want two or more disjoint sets of data which together contain the whole original repository.
git_filter
produces a lot of loose objects when it is finished,
so it is a very good idea to repack the repository when it is done
(e.g. git repack -ad
)
before continuing working with the resultant repository.
In addition to the new branches git_filter
outputs a .revinfo
text file
per branch with a line per new revision showing correspondance to the
original revision it is derived from. The purpose of this is to allow
recreation of tag information.
The purpose of the git_filter
program for me was to generate final
repositories which contain none of the original commits.
To do this I needed to do some further work.
The push_clean_repos
script creates a clean repository for each of the
filtered branches generated by the git_filter
run.
Each new repository has the same name as the corresponding branch.
It takes the same configuration file as argument as git_filter
.
The newtags.py
uses the .revinfo
files from git_filter
and
tag information in the source repository to map the tags in the source
to each of the destination repositories.
I have a git repository repo I want to split up. It is located in the current directory.
Then run:
./git_filter git_filter.cfg && ./push_clean_repos git_filter.cfg
git_filter
saves the necessary state (in the .git
directory) to allow
a full history processing to be resumed without generating all the initial
commits again.
We can run it once on the entire history and then run it incrementally on
new commits and produce the same result as starting from scratch each time.
This results in much shorter processing times.
Tell git_filter
to do this by adding the option continue
on the command
line after the configuration file, thus:
./git_filter git_filter.cfg continue
Just a plain
make
should be enough to build the git_filter
.
It automatically downloads libgit2
and builds it as part of the process. It has been tested to compile on
Mac (with Xcode installed) and on Ubuntu Linux.
Neither of these systems had a pre installed libgit2
.
Look at the filter.cfg
example, it is commented.
The config file parser is very simple, so a single space is the only allowed
separator. The parameter names should be exactly 4 characters followed by colon
and a space. Lines beginning with a #
are comment lines and are ignored.
-
REPO: <repo>
The configuration file should contain one REPO tag with the location of the repository to filter.
-
REVN: [range|ref] <refspec>
A revision specification. Either a range, e.g.
master~1000..master
or a (branch) reference, e.g.refs/heads/master
. -
BASE: <dir>
A base directory for the filter file lists.
-
FILT: <name> <file>
Space separated name and filter file pair.
-
TPFX: <tag prefix>
The prefix for tags and output repo names, prepended to the filter set name.