Cleaning up a git repo for reducing the repository size
Closed this issue · 6 comments
A clone of the GitHub repository is currently >500MB large. We should make a backup and consider to clean up to reduce the repository size, because it can help users with slower internet connection and we don't need the full history, see e.g.
https://medium.com/@sangeethkumar.tvm.kpm/cleaning-up-a-git-repo-for-reducing-the-repository-size-d11fa496ba48
Thank you for raising the issue, will take a look how to reduce the repository size
Biggest files in the repo:
(base) ➜ besca git:(master) ✗ ./size.sh
All sizes are in kB. The pack column is the size of the object, compressed, inside the pack file.
size pack SHA location
91326 23503 2b37bcdd1a6306dd57b0bc9d986393441f416afd besca/datasets/data/pbmc_storage_raw_downsampled.h5ad
56271 27156 a18f9f3740692204a0262530d2c6a0edbb9e8057 besca/datasets/data/pbmc3k_processed.h5ad
44202 33341 833f4407468139196b6f4f8d8ff163f6ea989b55 workbooks/02- Annotating Cellines.ipynb
30501 23494 b1a5a6014c9e33a1db06888c5e21da1f73a626d6 workbooks/celltype_annotation_besca.ipynb
29559 26717 02329d1f37acf94cfc9f3663e4ecc7c60d52311f besca/datasets/data/pbmc_storage_raw_downsampled.h5ad
28679 21438 08094d2a98d6f3820ceb1fa3b52cd5f021c6181d workbooks/celltype_annotation_besca.ipynb
22470 22314 1f6ace09527243790560c5155853fc1875e44af9 besca/datasets/data/pbmc3k_processed.h5ad
20617 15544 f8fb905ca97c9707b3a0b9e5fc8c4f01abe7593d workbooks/celltype_annotation_besca.ipynb
15816 11483 53db68a5f50fedc09126c4f1862f18f90336615b docs/source/tutorials/auto_annot_tutorial.ipynb
14298 10713 5692c254135a1993150fce2325338341f9a9959e docs/source/tutorials/notebook2_celltype_annotation_pbmc3k.ipynb
Obviously those do no longer exist in the repo.
Ran script was:
#!/bin/bash
#set -x
# Shows you the largest objects in your repo's pack file.
# Written for osx.
#
# @see http://stubbisms.wordpress.com/2009/07/10/git-script-to-show-largest-pack-objects-and-trim-your-waist-line/
# @author Antony Stubbs
# set the internal field spereator to line break, so that we can iterate easily over the verify-pack output
IFS=$'\n';
# list all objects including their size, sort by size, take top 10
objects=`git verify-pack -v .git/objects/pack/pack-*.idx | grep -v chain | sort -k3nr | head`
echo "All sizes are in kB. The pack column is the size of the object, compressed, inside the pack file."
output="size,pack,SHA,location"
for y in $objects
do
# extract the size in bytes
size=$((`echo $y | cut -f 5 -d ' '`/1024))
# extract the compressed size in bytes
compressedSize=$((`echo $y | cut -f 6 -d ' '`/1024))
# extract the SHA
sha=`echo $y | cut -f 1 -d ' '`
# find the objects location in the repository tree
other=`git rev-list --all --objects | grep $sha`
#lineBreak=`echo -e "\n"`
output="${output}\n${size},${compressedSize},${other}"
done
echo -e $output | column -t -s ', '
Source:
https://stackoverflow.com/questions/5613345/how-to-shrink-the-git-folder
Possibility to delete old large Files in Git history to reduce repository size:
First download the Bfg jar file from here: https://mvnrepository.com/artifact/com.madgag/bfg/1.12.16
Then follow this tutorial (https://rtyley.github.io/bfg-repo-cleaner/)
In the use case for this repository, the min file size was set to 5 MB. All files above 5 MB were deleted from the Git history, owing to that it was made possible to reduce the repository size to 133 MB.
Because LFS push is necessary to push these changes and my account ran out of the capacity to that this month, I will put this issue on hold
Outcome of research about this issue:
After the research following options would have been possible:
Clean with BFG and new remote repository
The big files can be deleted from the history with bfg. After this has been done you can push the clean repo to a new remote repository. Then the old files would be away.
Clean with BFG and force push
The big files can be deleted from the history with bfg. Important here is that all branches need to be pulled locally, in order to delete the big files everywhere in the repository. After that, you would need to push those change with a force push and the all branches tag to the remote repository. Since some files in the history are stored in LFS and not regular git storage, that could make problems and maybe damage the besca repo.
Clone git repo without all the history
With the following command users could clone besca without downloading all the unnecessary history: git clone --filter=blob:none
. Then the download would be way faster.
Outcome of discussion in PMDA team
Has been discussed with @kohleman and @hatjek
Clean with BFG and new remote repository
We would lose all issues, pull requests and so on, so that's not an option
Clean with BFG and force push
With this approach we would risk to damage the besca repository permanently, so that's not an option
Clone git repo without all the history
It is a temporary solution and would allow users with slow internet connection to still download besca in a short period
Decision
We will add the git clone flag to the read me, to show users faster downloads of besca can be accomplished with that flag
Other approaches (which didn't work out)
Squashing commits (not possible in this case):
There is not an easy way to delete the commits in the beginning, which include the large files. Normally, you could just easy squash the necessary commit, but because the files were committed in the first commit, so that's not possible