scratchfoundation/scratch-gui

File too large in the GIT history

garrapato opened this issue · 7 comments

I have noticed that the repository contains a very large file in the GIT history. Is it possible that someone made a commit with some file or directory by mistake? It can be for example nodes_modules or some other like that, unless that really is the correct size of the repository.

I just made a clone of the repository and it took a long time to download it.

The file I refer to in a fresh clone of the repository is:

scratch-gui/.git/objects/pack/pack-c81e535f2cf1cd650ef7a6e69553ee444473a465.pack

Expected Behavior

Less time to clone (download) the repo

Actual Behavior

I just made a clone of the repository and it took a very long time to download it.

Steps to Reproduce

$ git clone https://github.com/LLK/scratch-gui.git
Cloning into 'scratch-gui'...
remote: Enumerating objects: 48, done.
remote: Counting objects: 100% (48/48), done.
remote: Compressing objects: 100% (47/47), done.
Receiving objects:  100% (67966/67966), 11.68 MiB | 815.00 KiB

$ cd scratch-gui
$ du -ch . | grep "G\t"
1.0G	./.git/objects/pack
1.1G	./.git/objects
1.1G	./.git
1.1G	.
1.1G	total

$ cd ./.git/objects/pack
$ ll
total 2199344
-r--r--r--  1 garrapato  staff     1904120 Sep  1 06:09 pack-c81e535f2cf1cd650ef7a6e69553ee444473a465.idx
-r--r--r--  1 garrapato  staff  1111074667 Sep  1 06:24 pack-c81e535f2cf1cd650ef7a6e69553ee444473a465.pack

The file already measures more than 1 GB!

Possible solution

If a file or directory was uploaded (committed) by mistake, it must be deleted from the story and the following article shows how to do it:

Removing sensitive data from a repository

Operating System and Browser

Mac OS 10.11.14
Chrome Versión 76.0.3809.132 (Build oficial) (64 bits)

I hope this information will be useful

Regards

Maybe this is because of tutorial animated GIFs. We recommend the use of --depth 10 when cloning this repo.

In general, this repo's length commit history will make a full clone take an incredible amount of time. I recommend --depth 1 rather than 10 because it really can get to be too much.

There are indeed a large number of static image files. There's not anything we can do about the size of the history without causing a lot of conflicts.

Do maintainers typically just take the time/space for a full clone? I was under the impression you can't branch/commit/pull on a shallow clone.

I just did that. Looks like it's 1.9GB on disk and took about 33min on my connection for a full clone.

I was curious so I tried this https://stackoverflow.com/a/42544963/69002

It seems like what's taking up a lot of the space is dependencies being commited to the gh-pages branch. LIke d33ef36 for example.

Those lib.min.js seem to be 15MB-20MB each and get committed a few times a day in a few different subdirectories.

With the insight that big files are only in the gh-pages branch (which is usually disconnected from main), the situation can be improved using --single-branch when cloning.

~/code/scratch 
❯ git clone https://github.com/LLK/scratch-gui --single-branch 
Cloning into 'scratch-gui'...
remote: Enumerating objects: 44888, done.
remote: Counting objects: 100% (56/56), done.
remote: Compressing objects: 100% (26/26), done.
remote: Total 44888 (delta 38), reused 44 (delta 30), pack-reused 44832
Receiving objects: 100% (44888/44888), 313.38 MiB | 4.37 MiB/s, done.
Resolving deltas: 100% (29577/29577), done.

~/code/scratch [⏱ 1m14s]
❯ du -hs scratch-gui
392M	scratch-gui

~1 minute and ~400MB seem acceptable.