Remove blobs from repository history

Question

Remove blobs from repository history

braun-steven opened this issue a year ago · 3 comments

As observed by @felixdivo, the repository(-history) seems to be polluted with large blobs:

$ git clone git@github.com:SPFlow/SPFlow.git
Cloning into 'SPFlow'...
remote: Enumerating objects: 25226, done.
remote: Counting objects: 100% (5699/5699), done.
remote: Compressing objects: 100% (1544/1544), done.
remote: Total 25226 (delta 4186), reused 5580 (delta 4141), pack-reused 19527
Receiving objects: 100% (25226/25226), 128.74 MiB | 7.17 MiB/s, done.
Resolving deltas: 100% (17655/17655), done.
Updating files: 100% (424/424), done.

I.e., a fresh clone syncs 128 MiB.

Looking at the $N largest blobs, those are mostly datasets and Jupyter notebooks with data attached:

# List all blobs, sorted by size in descending order, display the top N
$ N=10; git rev-list --objects --all \
| git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' \
| sed -n 's/^blob //p' \
| sort --numeric-sort --reverse --key=2 \
| head -n $N \
| awk '{print $2"\t"$1}' \
| while read -r size sha; do \
    line=$(git rev-list --objects --all | grep $sha); \
    echo $size $line; \
  done \
| sort --numeric-sort --reverse --key=1 \
| awk 'BEGIN { OFS="\t"; print "Size", "SHA", "File"} {print $1, $2, $3}'


Size	SHA	File
78400000	dcb8e04c9726c685c87d37ebb9b533171dc5a298	src/spn/data/binary/convex.test.data
73273832	a48113203c301a7ee4ff55865a66b57b96be9873	src/spn/data/categorical/bookmarks/bookmarks.arff
47040016	bbce27659e0fc2b7ed2a64c127849380a477099b	src/spn/data/count/mnist/train-images-idx3-ubyte
23288778	375efd7d894f15b38173ec70a54186ec11b7a683	src/spn/experiments/hyperspectral/cerc15dai175.npz
23051776	09af7e69c90a2e3bcff23f794ecb141f189c4671	src/spn/data/binary/kdd.ts.data
20553260	b5d2937144d0d1340f14c53c807b83f79be21bbd	src/spn/data/binary/c20ng.ts.data
17956533	46bc71904158832b2e940df4f8cae997cd45c1ea	src/spn/tests/parametric_samples/exp_rate_2.csv
17500930	2bad7f5dd511f57fa5ea732c4f291f1fede50bb8	src/spn/tests/parametric_samples/gamma_shape2_scale0.5.csv
17311308	939c8391c77e4764c74553b011ba7ced924337de	src/spn/data/binary/msweb.ts.data
16890649	fc4ce9bff58b54a47a3dfb9d05b5e8cd41dec0a9	src/spn/tests/parametric_samples/norm_mean10_sd3.csv

We should use https://github.com/rtyley/bfg-repo-cleaner to remove these files before v1.0.0.

Answer 1 · 2023-12-11T13:38:01.000Z

I've tested it locally with a 1M limit on files. We can shrink the .git dir from 140M to 14M.

❯ bfg --strip-blobs-bigger-than 1M SPFlow-copy/

Using repo : /home/steven/projects/SPFlow-copy/.git

Scanning packfile for large blobs: 26686
Scanning packfile for large blobs completed in 125 ms.
Found 72 blob ids for large blobs - biggest=78400000 smallest=1159000
Total size (unpacked)=599451016
Found 984 objects to protect
Found 54 commit-pointing refs : HEAD, refs/heads/dev_learnSPN, refs/heads/dev_modules, ...

Protected commits
-----------------

These are your protected commits, and so their contents will NOT be altered:

 * commit d0559f82 (protected by 'HEAD')

Cleaning
--------

Found 1337 commits
Cleaning commits:       100% (1337/1337)
Cleaning commits completed in 765 ms.

Updating 32 Refs
----------------

	Ref                                              Before     After
	--------------------------------------------------------------------
	refs/heads/dev_learnSPN                        | 046b900d | 12e8d32b
	refs/heads/gh-pages                            | a748c667 | f631a248
	refs/heads/learnspn_b_test                     | 32b66ee2 | 49dd5cde
	refs/heads/master                              | d01f71d6 | 29101c0a
	refs/remotes/felixdivo/master                  | ad656c16 | 948b5523
	refs/remotes/origin/dev_learnSPN               | 046b900d | 12e8d32b
	refs/remotes/origin/gh-pages                   | a748c667 | f631a248
	refs/remotes/origin/learnspn_b_test            | 32b66ee2 | 49dd5cde
	refs/remotes/origin/master                     | d01f71d6 | 29101c0a
	refs/remotes/private/add-github-issue-template | c29ae78e | efc68608
	refs/remotes/private/add-sklearn-classifier    | 21c58970 | 2191089f
	refs/remotes/private/add-sphinx-documentation  | 5cac9101 | 19b9213d
	refs/remotes/private/bernoulli-layer           | 8db98c6c | f2270578
	refs/remotes/private/dev_learnSPN              | 046b900d | 12e8d32b
	refs/remotes/private/develop                   | 0f7639e6 | ba08e40c
	...

Updating references:    100% (32/32)
...Ref update completed in 71 ms.

Commit Tree-Dirt History
------------------------

	Earliest                                              Latest
	|                                                          |
	DDDDDDDDDDDDDDDDDDDDDDDDDDDDD....D.....D......Dm.......D.DDD

	D = dirty commits (file tree fixed)
	m = modified commits (commit message or parents changed)
	. = clean commits (no changes to file tree)

	                        Before     After
	-------------------------------------------
	First modified commit | 19266a9c | 762535f7
	Last dirty commit     | eb843114 | d579598c

Deleted files
-------------

	Filename                       Git id
	------------------------------------------------------------------------
	Corel5k-train.arff           | 7581f88b (7.5 MB)
	Corel5k.arff                 | 413f91ee (8.3 MB)
	accidents.ts.data            | cefacb18 (2.7 MB)
	ad.test.data                 | 1a6ecad2 (1.5 MB)
	ad.ts.data                   | b9dfaba7 (7.3 MB)
	adults.csv                   | ac05b131 (2.9 MB)
	ba_notebook.ipynb            | 00d87385 (4.1 MB)
	baudio.ts.data               | 53b36fd2 (2.9 MB)
	bbc.ts.data                  | 1109ad0c (3.4 MB)
	bern_prob0.7.csv             | 75a822ba (1.9 MB)
	bibtex-test.arff             | c6d43a85 (1.2 MB)
	bibtex-train.arff            | 70c68362 (2.2 MB)
	bibtex.arff                  | f702c066 (3.3 MB)
	bnetflix.ts.data             | 606d88a4 (2.9 MB)
	book.test.data               | 90460f56 (1.7 MB)
	...


In total, 2541 object ids were changed. Full details are logged here:

	/home/steven/projects/SPFlow-copy.bfg-report/2023-12-11/14-31-23

BFG run is complete! When ready, run: git reflog expire --expire=now --all && git gc --prune=now --aggressive

~/projects/SPFlow-copy dev_tensorly* ≡
❯ git reflog expire --expire=now --all && git gc --prune=now --aggressive
Enumerating objects: 26525, done.
Counting objects: 100% (26525/26525), done.
Delta compression using up to 8 threads
Compressing objects: 100% (26101/26101), done.
Writing objects: 100% (26525/26525), done.
Total 26525 (delta 18930), reused 5835 (delta 0), pack-reused 0

~/projects
❯ du -sh SPFlow-copy/.git
14M	SPFlow-copy/.git

~/projects
❯ du -sh SPFlow/.git
140M	SPFlow/.git

Let's do this right before the v1.0.0 release, like #146.

Answer 2 · 2023-12-11T14:14:28.000Z

Sounds reasonable. Though you could even consider going down to 500kB or below, as long as we only/mainly delete .arff, .data, ... files.

Answer 3 · 2023-12-11T14:16:50.000Z

Yeah, I guess we should additionally just filter out any kind of dataset file.