`git clone` instead of git archive | untar

Question

`git clone` instead of git archive | untar

Closed this issue 6 years ago · 9 comments

Would be convenient to mimic the r10k behavior where the deployed modules are actual git repositories instead of being just a copy of all files.

E.g.

$ cat Puppetfile  <<EOF
mod 'stdlib',
  :git    => 'https://github.com/puppetlabs/puppetlabs-stdlib.git'
EOF

$ r10k puppetfile install
$ git -C modules/stdlib remote -v
cache   /home/chutzimir/.r10k/git/https---github.com-puppetlabs-puppetlabs-stdlib.git (fetch)
cache   /home/chutzimir/.r10k/git/https---github.com-puppetlabs-puppetlabs-stdlib.git (push)
origin  https://github.com/puppetlabs/puppetlabs-stdlib.git (fetch)
origin  https://github.com/puppetlabs/puppetlabs-stdlib.git (push)

The overhead of having a clone is quite minimal when cloning a repo on the same file system, but it actually allows you to work in the deployed modules and make commits there.

The can be optional or not.

Answer 1 · 2018-02-09T15:29:51.000Z

I don't see the benefit of this.

I'm copying all the files via git archive from a central git repository clone to have a central cache if this git module is used anywhere else. You would have to clone the whole repository again with you suggestion instead of pulling just the updates and copy the files on the local file system, which is way faster than re-cloning the whole repository.

To summarize: the r10k behavior is much slower even more so with large Puppet setups with many environments which are using the same modules, which is the reason g10k is behaving the way it is.

Answer 2 · 2018-02-09T16:19:23.000Z

What I am suggesting is pretty much the same as what you do, but I guess I didn't explain it well.

You have the central clone which gets updated. This is what you are doing now.

Then you use that clone as a source of all the other deployments.

If you use the "git alternates" mechanism to share the git objects between the central repository and the deployments, the git clone overhead is very minimal.

Once you have a git repo pointing to the central cache, you can get your incremental updates with "git fetch" and "git checkout" from the cache. That is pretty fast.

Maybe it is easier to show the list of commands that can be used to create a repo that mimics the r10k behavior I am describing.

# Prepare your cache (you already doing this)
git clone --mirror $url $cacheDir
# Prepare a git repo for the final destination
git init $targetDir

# Add the remotes for the git repo - both the real one and the cache for convenience
git --git-dir $targetDir/.git remote add origin $url
git --git-dir $targetDir/.git remote add cache $cacheDir

# First tell that repo it can find the objects in the cache directory so it only needs to fetch the refs
echo $cacheDir/objects > $targetDir/.git/objects/info/alternates

# Fetch the refs. Doesn't matter much where you get them from,
# the objects will not be fetched - they are already in the "alternate" directory.
# This operation would also be very fast for incremental updates
git --git-dir $targetDir/.git fetch cache
git --git-dir $targetDir/.git fetch origin

# And now update your destination tree
git -C $targetDir clean -fdx # Remove any unmanaged files (should be noop for most deployments)
git -C $targetDir reset --hard # Throw away any changes (should be noop as well)
git -C $targetDir checkout $(git --git-dir $cacheDir rev-parse --verify $tree)

Answer 3 · 2018-02-09T16:24:48.000Z

Ah, I see.

Have you timed the differences between your method and the one g10k is currently doing? I'm doubting there is a noticeable (> 1s) difference even when using 20 Git repositories, but I'm willing to be convinced otherwise.

What's making a sync slow in general is issuing too many Puppetlabs Forge API calls and the initial git clone with the subsequent git pull commands. It's mostly never the local file system operations.

Answer 4 · 2018-02-09T17:10:48.000Z

This is in the eye of the beholder, but coming from r10k, anything is fast to me :)

Taking one of our deployments. 116 repos in our Puppetfile, and 11 branches (environments).

git archive methods deploys this in 20 seconds, while using the method described here takes 30 seconds.

It seems like a lot but this is an initial deploy from a clean slate. Not easy to time the incremental updates.

Answer 5 · 2018-03-28T20:15:40.000Z

Watching this.
My environments are stored on EFS in AWS which is very slow for small file operations. The archive/untar currently takes about 120s in our main repo if there is a change in 1 branch. I've artificially bumped our EFS volume to 100GB to allow higher throughput, but it doesn't help.

I'm wondering if there's other ways to update the environments similar to an rsync where less IOs are involved.

Answer 6 · 2018-03-28T20:27:57.000Z

120 seconds with g10k? How long does your update take with r10k's update behaviour?

Answer 7 · 2018-03-28T20:37:12.000Z

In that EFS configuration a couple hours.

This is just currently a test environment where we're running puppet in containers, having the EFS makes sharing data across a docker swarm in AWS pretty easy.

Answer 8 · 2018-03-29T12:19:32.000Z

In this case I would be very surprised if even more multiple parallel git clone or git pull processes would be faster for your EFS setup. Which is what this issue is discussing. Trading the git archive | tar with more git clone or git pull processes.

What you could do is switch to the original Puppetlabs approach of having g10k or r10k run on every Puppetserver instead.
Another solution would be to use a server in front of your Puppetservers without your EFS setup and simply use rsync or something like syncthing instead.

Answer 9 · 2018-06-21T16:07:18.000Z

git archive methods deploys this in 20 seconds, while using the method described here takes 30 seconds.

Which isn't a convincing argument to implement the method described here ;)

I'm wondering if there's other ways to update the environments similar to an rsync where less IOs are involved.

You can always try limiting the amount of workers that are doing the I/O:

  -maxextractworker int
        how many Goroutines are allowed to run in parallel for local Git and Forge module extracting processes (git clone, untar and gunzip) (default 20)
  -maxworker int
        how many Goroutines are allowed to run in parallel for Git and Forge module resolving (default 50)

If you still have a problem, suggestion or idea please open a new ticket for that.