actions/checkout

Cache for LFS

gordinmitya opened this issue ยท 14 comments

Hello,

Can you provide some info how to use checkout action with cache properly?
I wonder because frequent builds spend all the lfs download quota very fast (1GB per month). (Because it download everything from scratch every time, right?)

Also it would be great if there will be such option from the box!
Thank you in advance.

It looks like lfs info is under .git/lfs so you might be able to cache that directory

I was trying to do this just today but unfortunately this doesn't seem to work. I have a cache action for .git/lfs/objects (where all the data files are) and execute actions/checkout@v2 with lfs: true and clean: false and I get the following output

Syncing repository: owner/repo
Working directory is 'd:\a\repo\repo'
"C:\Program Files\Git\bin\git.exe" version
git version 2.25.1.windows.1
"C:\Program Files\Git\bin\git.exe" lfs version
git-lfs/2.10.0 (GitHub; windows amd64; go 1.12.7; git a526ba6b)
"C:\Program Files\Git\bin\git.exe" config --local --get remote.origin.url
##[error]fatal: --local can only be used inside a git repository
Deleting the contents of 'd:\a\repo\repo'
...

Looking at the code it seems like this wipes everything so the benefit of caching is lost.

Thanks @dabo248 for the solution.

However, it is quite verbose for something that should be (In my opinion) the default behavior.

Due the github's policy to bill for git lfs download bandwidth, git lfs is not usable with github actions in practice unless cached.

So it would make sense that the option lfs: true of actions/checkout caches the LFS data by default. Wouldn't it?

@ericsciple, can you re-open the issue?

@dabo248 Thanks for the solution, but I was curious to know if that key is even valid? From the documentation, the key cannot be a directory so your key is being interpreted as a string; the cache would not get invalidated upon new additions to the lfs directory. Correct me if I'm wrong?

but I was curious to know if that key is even valid

you're right @samesfahani-tuplehealth. Using .git/lfs for the key is not a good idea and will cause the cache to be useless as soon as the large files are changed.

But before calling git lfs pull, the files will be there as tiny text files containing a hash. And we can build a key based on that tiny text files.

Here's an example:

- name: Checkout repository
  uses: actions/checkout@v2

- name: Cache git lfs
  uses: actions/cache@v1.1.0
  with:
    path: .git/lfs
    key: ${{ hashFiles('**/*.zip') }} # Adapt to target the type of the files committed with git lfs

- name: Pull lfs data, if not cached
  run: git lfs pull

@jcornaz That's pretty much what I came to understand as well. However, even with that approach, if you have more than just zip files or you no longer want zip files to cache, whatever the use case, then you would need to update your CI file. How about this:

- name: Checkout code
  uses: actions/checkout@v2

- name: Create LFS file list
  run: git lfs ls-files -l | cut -d' ' -f1 | sort > .lfs-assets-id

- name: Restore LFS cache
  uses: actions/cache@v2
  id: lfs-cache
  with:
    path: .git/lfs
    key: ${{ runner.os }}-lfs-${{ hashFiles('.lfs-assets-id') }}-v1

- name: Git LFS Pull
  run: git lfs pull

Source: https://www.develer.com/en/avoiding-git-lfs-bandiwdth-waste-with-github-and-circleci/

The author's method was meant for CircleCI, but the same concept still stands; we create a file that has all the hashes tracked within LFS and we run hashFiles on that. Any time a file is added or removed from LFS, this file should get invalidated. I've also added a -v1 to the end in case you ever want to invalidate the cache manually, but you shouldn't need to.

@samesfahani-tuplehealth You're absolutely right, the static key is not useful. Setting a hash as the key is the way to go!

Is it possible for us to get this added into the checkout action itself as a flag?

@ericsciple can we re-open this iisue?

We were just hit by this with git lfs burning through our bandwidth in one day.
It's unexpected and just leads to a poor user experience.

Has anyone molded @samesfahani-tuplehealth's solution into a gh-action yet?

This is now available as a separate gh-action at https://github.com/nschloe/action-checkout-with-lfs-cache. Instead of

- name: Checkout code
  uses: actions/checkout@v2
  with:
    lfs: true

just do

- name: Checkout code
  uses: nschloe/action-checkout-with-lfs-cache@v1

This is now available as a separate gh-action at https://github.com/nschloe/action-checkout-with-lfs-cache.

Thank you, but it will be nice to add a "fetch submodules" option as well to your action.

Any news or reconsiderations on reopening this issue @ericsciple? I just fell into the same trap and am forced to rely on the nschloe solution. Counting LFS usage within GitHub is unusual in itself, but it should at least be more difficult to do so.

I wanted to provide my way of achieving this. Also supports change of individual files. Thanks to @nschloe for providing a way to generate a key for LFS files. If you want to verify if it only downloads the new/updated files add GIT_TRACE=1 as environment variable on ifs pull step.

lfs-cache:
  runs-on: ubuntu-latest
  steps:
    - name: Checkout
      uses: actions/checkout@v3

    - name: Create LFS file list
      run: git lfs ls-files --long | cut -d ' ' -f1 | sort > .lfs-assets-id

    - name: LFS Cache
      uses: actions/cache@v3
      with:
        path: .git/lfs/objects
        key: ${{ runner.os }}-lfs-${{ hashFiles('.lfs-assets-id') }}
        restore-keys: |
          ${{ runner.os }}-lfs-

    - name: Git LFS Pull
      run: git lfs pull

@ericsciple Would you be willing to reopen this issue?

It's still a problem for us, and I'd suggest the following:

  • Don't count LFS usage within the Github network. If people are using LFS for binaries instead of putting huge files into Git, that's a net win for your infrastructure costs, so you'd want to encourage it.
  • Add proper support for LFS caching to this action.

Friendly ping @cory-miller ๐Ÿ˜„

(see #165 (comment))