Repo is getting large

Question

Repo is getting large

jchodera opened this issue 5 years ago · 3 comments

This repo is getting to be pretty large (>1.2GB), so I'd like to suggest we break it up into smaller repos or put files on osf.io and link from here so that it's still possible for people to check it out without losing their connection.

Answer 1 · 2020-03-19T08:00:24.000Z

What about Git's Large File Storage?

Answer 2 · 2020-03-20T03:29:15.000Z

That could work, but we also need to clean things up with the BFG Repo Cleaner.

Other alternatives:

gzipped tarballs of input structures instead of uncompressed structures
store input files on osf.io

Answer 3 · 2020-03-22T11:43:20.000Z

Another option is microsoft/scalar, or Git with partial clone. Scalar will get you set up with partial clone and Git's sparse-checkout feature automatically. That will save network time because you are not downloading every version of every file. The sparse-checkout means you can expand the working directory as you need it.

I tested against this repo with Scalar 20.03.167.1 on Windows. It should work the same on Mac.

Start by cloning:

$ scalar clone https://github.com/FoldingAtHome/coronavirus
Clone parameters:
  Repo URL:     https://github.com/FoldingAtHome/coronavirus
  Branch:       Default
  Cache Server: Default
  Local Cache:  C:\.scalarCache
  Destination:  C:\_git\t\coronavirus
  FullClone:     False
Authenticating...Succeeded
Fetching objects from remote...Succeeded
Checking out 'master'...Succeeded

$ cd coronavirus/src/

$ ls
README.md

Notice that only README.md is on disk. This is because the sparse-checkout is set to only include files at root. If you want the files for a certain directory (or list of directories) you can use the git sparse-checkout command:

$ git sparse-checkout set system-preparation/6m17
remote: Enumerating objects: 10, done.
remote: Counting objects: 100% (10/10), done.
remote: Compressing objects: 100% (7/7), done.
remote: Total 19 (delta 5), reused 3 (delta 3), pack-reused 9
Receiving objects: 100% (19/19), 39.51 MiB | 11.23 MiB/s, done.
Resolving deltas: 100% (5/5), done.
Updating files: 100% (19/19), done.

$ ls
README.md  system-preparation/

$ ls system-preparation/
6m17/  README.md

If you really want every file at HEAD, then git sparse-checkout disable will populate the entire working directory. However, this repo is large because of the number of files, not because of a deep history. When I disabled sparse-checkout I downloaded around a gig of data:

$ git sparse-checkout disable
remote: Enumerating objects: 82, done.
remote: Counting objects: 100% (82/82), done.
remote: Compressing objects: 100% (34/34), done.
remote: Total 576 (delta 68), reused 49 (delta 48), pack-reused 494
Receiving objects: 100% (576/576), 1.04 GiB | 11.14 MiB/s, done.
Resolving deltas: 100% (335/335), done.
Updating files: 100% (741/741), done.

$ ls
potential-targets/  publications/  README.md  system-preparation/

Note that you can do all of this with plain Git, but Scalar makes it a bit easier. As the repo continues to grow, Scalar can help in a few extra ways, too.

Good luck!