github/git-sizer

Question: How is logical file size calculated (Total size of files)

reaandrew opened this issue · 1 comments

I am trying to understand how git-sizer calculates logical file size. I have included some observations below, using finagle as an example. I have included an example of size on disk from Github, du -h and looking at the packed size - roughly these all seem to come to the same'ish figure.

My question is around logical size - I have included a script which I found and compared the output with git-sizer and there is a difference and was wondering if someone could help me understand what that is. I am assuming it has something to do with deletions and evaluating this based on walking the tree

Example project: https://github.com/twitter/finagle
Github reported size: 101.81MB
Git pack size (.pack file): 102MB (using ls -lh rounded up from 106701022 bytes)
Bash script using du -h: 132M (I understand this to be the contents of the .git directory along with the files at the version they are at in the working directory)
Git-Sizer: Total size of files [8] | 159 MiB (I understand this to be logical size and not considering any of the compression techniques)
*Bash script using git rev-list and git cat-file: 175M ( Uniquing the filename )
**Bash script using git rev-list and git cat-file: 291M ( Summing all the filesizes as they appear without uniqing the filename.)

* script: git rev-list --no-walk --all --objects --date-order | sort -u -t' ' -k2r | git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' | grep ^blob | cut -d ' ' -f3 | paste -s -d + - | bc | numfmt --field 1 --to=iec

** script: git rev-list --no-walk --all --objects --date-order | git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' | grep ^blob | cut -d ' ' -f3 | paste -s -d + - | bc | numfmt --field 1 --to=iec

All sizes are the sizes of the corresponding Git objects after being uncompressed and de-deltad, e.g., as output by git cat-file --batch-check. The "size of checkout" I think is the sum of the size of the blobs that would be written as files to disk (i.e., ignoring the size of any directories that would be needed to hold them) but not considering any smudge filters or end-of-line conversions. Symlinks, I presume, are the length of the name of the path that is being referenced.

The tabular output rounds the sizes to human-readable numbers, but you can get the exact numbers from the JSON output (-j).