bitcoin-dev-project/sim-ln

Repo is unnecessarily large

Opened this issue · 7 comments

I noticed while creating branches that it was taking a while and after a quick look it seems it's because the repo is 247mb.

I ran the following commands to list the largest blobs and it looks like some builds were accidentally committed early on:

git rev-list --objects --all |
  git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' |
  sed -n 's/^blob //p' |
  sort --numeric-sort --key=2
# for example this blob was added and deleted a minute later
git whatchanged --all --find-object=6507a7347f3b151262807d43af4114d287b0d446

The following is a SO comment and post that discusses techniques for removing blobs from history: https://stackoverflow.com/questions/2100907/how-to-remove-delete-a-large-file-from-commit-history-in-the-git-repository/61602985#61602985

As these files appeared to have been committed and pushed in error I would support their removal from the history.

cc @okjodom @sr-gi, I think it's worthwhile doing a once off cleanup?

I do agree. It's not worth having an unnecessary big repo because of files that were pushed on an accident

+1 on cleanup

What do you think of an interactive rebase to drop PRs #9 and #62 ?

That goes over my head git-wise, but I'll be ok with doing so if possible

having a go at it

I just experimented with this on a fresh clone of the repo

Interactive rebase to remove commits 72c4f11 then b87a0ae .. 1a75d06, followed by further rewrite to remove associated blobs was my starting step.

git rebase --interactive 4086f94` to drop `b87a0ae` .. `1a75d06` and `72c4f11

For blob clean up, git-filter-repo from the Stack Overflow thread work effectively. From the SO discussion, this tool provides the same capabilities as git filter-branch

  • to remove activity-generator blobs
    python3 git-filter-repo --invert-paths --path-match activity-generator --force
  • to remove js blobs
    python3 git-filter-repo --invert-paths --path-match js --force

This results in blob set
2.blobs.after.txt

whereas before, the list of blobs was
2.blobs.before.txt

I used the original rev-list command to list blobs in repo

git rev-list --objects --all |
  git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' |
  sed -n 's/^blob //p' |
  sort --numeric-sort --key=2 | file.txt

From here, I'm not sure how we'd pus this revised history to upstream and get forks, clones, to receive the same.