bedapub/besca

Cleaning up a git repo for reducing the repository size

Closed this issue · 6 comments

A clone of the GitHub repository is currently >500MB large. We should make a backup and consider to clean up to reduce the repository size, because it can help users with slower internet connection and we don't need the full history, see e.g.
https://medium.com/@sangeethkumar.tvm.kpm/cleaning-up-a-git-repo-for-reducing-the-repository-size-d11fa496ba48

Thank you for raising the issue, will take a look how to reduce the repository size

Biggest files in the repo:

(base) ➜  besca git:(master) ✗ ./size.sh
All sizes are in kB. The pack column is the size of the object, compressed, inside the pack file.
size   pack   SHA                                       location
91326  23503  2b37bcdd1a6306dd57b0bc9d986393441f416afd  besca/datasets/data/pbmc_storage_raw_downsampled.h5ad
56271  27156  a18f9f3740692204a0262530d2c6a0edbb9e8057  besca/datasets/data/pbmc3k_processed.h5ad
44202  33341  833f4407468139196b6f4f8d8ff163f6ea989b55  workbooks/02-                                                     Annotating  Cellines.ipynb
30501  23494  b1a5a6014c9e33a1db06888c5e21da1f73a626d6  workbooks/celltype_annotation_besca.ipynb
29559  26717  02329d1f37acf94cfc9f3663e4ecc7c60d52311f  besca/datasets/data/pbmc_storage_raw_downsampled.h5ad
28679  21438  08094d2a98d6f3820ceb1fa3b52cd5f021c6181d  workbooks/celltype_annotation_besca.ipynb
22470  22314  1f6ace09527243790560c5155853fc1875e44af9  besca/datasets/data/pbmc3k_processed.h5ad
20617  15544  f8fb905ca97c9707b3a0b9e5fc8c4f01abe7593d  workbooks/celltype_annotation_besca.ipynb
15816  11483  53db68a5f50fedc09126c4f1862f18f90336615b  docs/source/tutorials/auto_annot_tutorial.ipynb
14298  10713  5692c254135a1993150fce2325338341f9a9959e  docs/source/tutorials/notebook2_celltype_annotation_pbmc3k.ipynb

Obviously those do no longer exist in the repo.

Ran script was:

#!/bin/bash
#set -x

# Shows you the largest objects in your repo's pack file.
# Written for osx.
#
# @see http://stubbisms.wordpress.com/2009/07/10/git-script-to-show-largest-pack-objects-and-trim-your-waist-line/
# @author Antony Stubbs

# set the internal field spereator to line break, so that we can iterate easily over the verify-pack output
IFS=$'\n';

# list all objects including their size, sort by size, take top 10
objects=`git verify-pack -v .git/objects/pack/pack-*.idx | grep -v chain | sort -k3nr | head`

echo "All sizes are in kB. The pack column is the size of the object, compressed, inside the pack file."

output="size,pack,SHA,location"
for y in $objects
do
    # extract the size in bytes
    size=$((`echo $y | cut -f 5 -d ' '`/1024))
    # extract the compressed size in bytes
    compressedSize=$((`echo $y | cut -f 6 -d ' '`/1024))
    # extract the SHA
    sha=`echo $y | cut -f 1 -d ' '`
    # find the objects location in the repository tree
    other=`git rev-list --all --objects | grep $sha`
    #lineBreak=`echo -e "\n"`
    output="${output}\n${size},${compressedSize},${other}"
done

echo -e $output | column -t -s ', '

Source:
https://stackoverflow.com/questions/5613345/how-to-shrink-the-git-folder

Possibility to delete old large Files in Git history to reduce repository size:

First download the Bfg jar file from here: https://mvnrepository.com/artifact/com.madgag/bfg/1.12.16

Then follow this tutorial (https://rtyley.github.io/bfg-repo-cleaner/)

In the use case for this repository, the min file size was set to 5 MB. All files above 5 MB were deleted from the Git history, owing to that it was made possible to reduce the repository size to 133 MB.

Because LFS push is necessary to push these changes and my account ran out of the capacity to that this month, I will put this issue on hold

Outcome of research about this issue:

After the research following options would have been possible:

Clean with BFG and new remote repository
The big files can be deleted from the history with bfg. After this has been done you can push the clean repo to a new remote repository. Then the old files would be away.

Clean with BFG and force push
The big files can be deleted from the history with bfg. Important here is that all branches need to be pulled locally, in order to delete the big files everywhere in the repository. After that, you would need to push those change with a force push and the all branches tag to the remote repository. Since some files in the history are stored in LFS and not regular git storage, that could make problems and maybe damage the besca repo.

Clone git repo without all the history
With the following command users could clone besca without downloading all the unnecessary history: git clone --filter=blob:none. Then the download would be way faster.

Outcome of discussion in PMDA team

Has been discussed with @kohleman and @hatjek

Clean with BFG and new remote repository
We would lose all issues, pull requests and so on, so that's not an option

Clean with BFG and force push
With this approach we would risk to damage the besca repository permanently, so that's not an option

Clone git repo without all the history
It is a temporary solution and would allow users with slow internet connection to still download besca in a short period

Decision

We will add the git clone flag to the read me, to show users faster downloads of besca can be accomplished with that flag

Other approaches (which didn't work out)

Squashing commits (not possible in this case):
There is not an easy way to delete the commits in the beginning, which include the large files. Normally, you could just easy squash the necessary commit, but because the files were committed in the first commit, so that's not possible

Issue was fixed in merge request #246