/git-of-theseus

Analyze how a Git repo grows over time

Primary LanguagePythonApache License 2.0Apache-2.0

travis badge pypi badge

Some scripts to analyze Git repos. Produces cool looking graphs like this (running it on git itself):

git

Installing

Run pip install git-of-theseus

Running

First, you need to run git-of-theseus-analyze <path to repo> (see git-of-theseus-analyze --help for a bunch of config). This will analyze a repository and might take quite some time.

After that, you can generate plots! Here are some ways you can do that:

  1. Run git-of-theseus-stack-plot cohorts.json which will write to stack_plot.png
  2. Run git-of-theseus-survival-plot survival.json which will write to survival_plot.png (run it with --help for some options)

If you want to plot multiple repositories, have to run git-of-theseus-analyze separately for each project and store the data in separate directories using the --outdir flag. Then you can run git-of-theseus-survival-plot <foo/survival.json> <bar/survival.json> (optionally with the --exp-fit flag to fit an exponential decay)

Help

AttributeError: Unknown property labels – upgrade matplotlib if you are seeing this. pip install matplotlib --upgrade

Some pics

Survival of a line of code in a set of interesting repos:

git

This curve is produced by the git-of-theseus-survival-plot script and shows the percentage of lines in a commit that are still present after x years. It aggregates it over all commits, no matter what point in time they were made. So for x=0 it includes all commits, whereas for x>0 not all commits are counted (because we would have to look into the future for some of them). The survival curves are estimated using Kaplan-Meier.

You can also add an exponential fit:

git

Linux – stack plot:

git

This curve is produced by the git-of-theseus-stack-plot script and shows the total number of lines in a repo broken down into cohorts by the year the code was added.

Node – stack plot:

git

Rails – stack plot:

git

Plotting other stuff

git-of-theseus-analyze will write exts.json, cohorts.json and authors.json. You can run git-of-theseus-stack-plot authors.json to plot author statistics as well, or git-of-theseus-stack-plot exts.json to plot file extension statistics. For author statistics, you might want to create a .mailmap file to deduplicate authors. For instance, here's the author statistics for Kubernetes:

git

You can also normalize it to 100%. Here's author statistics for Git:

git

Other stuff

Markovtsev Vadim implemented a very similar analysis that claims to be 20%-6x faster than Git of Theseus. It's named Hercules and there's a great blog post about all the complexity going into the analysis of Git history.