Remove blobs from repository history
braun-steven opened this issue · 3 comments
braun-steven commented
As observed by @felixdivo, the repository(-history) seems to be polluted with large blobs:
$ git clone git@github.com:SPFlow/SPFlow.git
Cloning into 'SPFlow'...
remote: Enumerating objects: 25226, done.
remote: Counting objects: 100% (5699/5699), done.
remote: Compressing objects: 100% (1544/1544), done.
remote: Total 25226 (delta 4186), reused 5580 (delta 4141), pack-reused 19527
Receiving objects: 100% (25226/25226), 128.74 MiB | 7.17 MiB/s, done.
Resolving deltas: 100% (17655/17655), done.
Updating files: 100% (424/424), done.
I.e., a fresh clone syncs 128 MiB.
Looking at the $N
largest blobs, those are mostly datasets and Jupyter notebooks with data attached:
# List all blobs, sorted by size in descending order, display the top N
$ N=10; git rev-list --objects --all \
| git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' \
| sed -n 's/^blob //p' \
| sort --numeric-sort --reverse --key=2 \
| head -n $N \
| awk '{print $2"\t"$1}' \
| while read -r size sha; do \
line=$(git rev-list --objects --all | grep $sha); \
echo $size $line; \
done \
| sort --numeric-sort --reverse --key=1 \
| awk 'BEGIN { OFS="\t"; print "Size", "SHA", "File"} {print $1, $2, $3}'
Size SHA File
78400000 dcb8e04c9726c685c87d37ebb9b533171dc5a298 src/spn/data/binary/convex.test.data
73273832 a48113203c301a7ee4ff55865a66b57b96be9873 src/spn/data/categorical/bookmarks/bookmarks.arff
47040016 bbce27659e0fc2b7ed2a64c127849380a477099b src/spn/data/count/mnist/train-images-idx3-ubyte
23288778 375efd7d894f15b38173ec70a54186ec11b7a683 src/spn/experiments/hyperspectral/cerc15dai175.npz
23051776 09af7e69c90a2e3bcff23f794ecb141f189c4671 src/spn/data/binary/kdd.ts.data
20553260 b5d2937144d0d1340f14c53c807b83f79be21bbd src/spn/data/binary/c20ng.ts.data
17956533 46bc71904158832b2e940df4f8cae997cd45c1ea src/spn/tests/parametric_samples/exp_rate_2.csv
17500930 2bad7f5dd511f57fa5ea732c4f291f1fede50bb8 src/spn/tests/parametric_samples/gamma_shape2_scale0.5.csv
17311308 939c8391c77e4764c74553b011ba7ced924337de src/spn/data/binary/msweb.ts.data
16890649 fc4ce9bff58b54a47a3dfb9d05b5e8cd41dec0a9 src/spn/tests/parametric_samples/norm_mean10_sd3.csv
We should use https://github.com/rtyley/bfg-repo-cleaner to remove these files before v1.0.0.
braun-steven commented
I've tested it locally with a 1M
limit on files. We can shrink the .git
dir from 140M
to 14M
.
❯ bfg --strip-blobs-bigger-than 1M SPFlow-copy/
Using repo : /home/steven/projects/SPFlow-copy/.git
Scanning packfile for large blobs: 26686
Scanning packfile for large blobs completed in 125 ms.
Found 72 blob ids for large blobs - biggest=78400000 smallest=1159000
Total size (unpacked)=599451016
Found 984 objects to protect
Found 54 commit-pointing refs : HEAD, refs/heads/dev_learnSPN, refs/heads/dev_modules, ...
Protected commits
-----------------
These are your protected commits, and so their contents will NOT be altered:
* commit d0559f82 (protected by 'HEAD')
Cleaning
--------
Found 1337 commits
Cleaning commits: 100% (1337/1337)
Cleaning commits completed in 765 ms.
Updating 32 Refs
----------------
Ref Before After
--------------------------------------------------------------------
refs/heads/dev_learnSPN | 046b900d | 12e8d32b
refs/heads/gh-pages | a748c667 | f631a248
refs/heads/learnspn_b_test | 32b66ee2 | 49dd5cde
refs/heads/master | d01f71d6 | 29101c0a
refs/remotes/felixdivo/master | ad656c16 | 948b5523
refs/remotes/origin/dev_learnSPN | 046b900d | 12e8d32b
refs/remotes/origin/gh-pages | a748c667 | f631a248
refs/remotes/origin/learnspn_b_test | 32b66ee2 | 49dd5cde
refs/remotes/origin/master | d01f71d6 | 29101c0a
refs/remotes/private/add-github-issue-template | c29ae78e | efc68608
refs/remotes/private/add-sklearn-classifier | 21c58970 | 2191089f
refs/remotes/private/add-sphinx-documentation | 5cac9101 | 19b9213d
refs/remotes/private/bernoulli-layer | 8db98c6c | f2270578
refs/remotes/private/dev_learnSPN | 046b900d | 12e8d32b
refs/remotes/private/develop | 0f7639e6 | ba08e40c
...
Updating references: 100% (32/32)
...Ref update completed in 71 ms.
Commit Tree-Dirt History
------------------------
Earliest Latest
| |
DDDDDDDDDDDDDDDDDDDDDDDDDDDDD....D.....D......Dm.......D.DDD
D = dirty commits (file tree fixed)
m = modified commits (commit message or parents changed)
. = clean commits (no changes to file tree)
Before After
-------------------------------------------
First modified commit | 19266a9c | 762535f7
Last dirty commit | eb843114 | d579598c
Deleted files
-------------
Filename Git id
------------------------------------------------------------------------
Corel5k-train.arff | 7581f88b (7.5 MB)
Corel5k.arff | 413f91ee (8.3 MB)
accidents.ts.data | cefacb18 (2.7 MB)
ad.test.data | 1a6ecad2 (1.5 MB)
ad.ts.data | b9dfaba7 (7.3 MB)
adults.csv | ac05b131 (2.9 MB)
ba_notebook.ipynb | 00d87385 (4.1 MB)
baudio.ts.data | 53b36fd2 (2.9 MB)
bbc.ts.data | 1109ad0c (3.4 MB)
bern_prob0.7.csv | 75a822ba (1.9 MB)
bibtex-test.arff | c6d43a85 (1.2 MB)
bibtex-train.arff | 70c68362 (2.2 MB)
bibtex.arff | f702c066 (3.3 MB)
bnetflix.ts.data | 606d88a4 (2.9 MB)
book.test.data | 90460f56 (1.7 MB)
...
In total, 2541 object ids were changed. Full details are logged here:
/home/steven/projects/SPFlow-copy.bfg-report/2023-12-11/14-31-23
BFG run is complete! When ready, run: git reflog expire --expire=now --all && git gc --prune=now --aggressive
~/projects/SPFlow-copy dev_tensorly* ≡
❯ git reflog expire --expire=now --all && git gc --prune=now --aggressive
Enumerating objects: 26525, done.
Counting objects: 100% (26525/26525), done.
Delta compression using up to 8 threads
Compressing objects: 100% (26101/26101), done.
Writing objects: 100% (26525/26525), done.
Total 26525 (delta 18930), reused 5835 (delta 0), pack-reused 0
~/projects
❯ du -sh SPFlow-copy/.git
14M SPFlow-copy/.git
~/projects
❯ du -sh SPFlow/.git
140M SPFlow/.git
Let's do this right before the v1.0.0 release, like #146.
felixdivo commented
Sounds reasonable. Though you could even consider going down to 500kB or below, as long as we only/mainly delete .arff
, .data
, ... files.
braun-steven commented
Yeah, I guess we should additionally just filter out any kind of dataset file.