/puppet-grokmirror

Grokmirror module for puppet

Primary LanguagePuppet

grokmirror

Table of Contents

  1. Overview
  2. Usage - Basic usage
  3. Reference - Configuration options
  4. Limitations - OS compatibility, etc.

Overview

Build Status

A Puppet module for managing deployments of Grokmirror - the smart way to mirror large git repository collections

Usage

To use this module you can either directly include it in your module tree, or add the following to your Puppetfile:

  mod 'mricon-grokmirror'

A node should then be assigned the relevant grokmirror classes. You must pass a sites hash with at least one site configuration. E.g. for mirroring kernel.org git repositories:

  class { 'grokmirror':
    sites => {
      'kernelorg': {
        pull_remote_manifest => 'https://git.kernel.org/manifest.js.gz',
        pull_site_url => 'git://git.kernel.org',
        pull_include => [
          '/pub/scm/linux/kernel/git/torvalds/*',
          '/pub/scm/linux/kernel/git/stable/*',
        ],
      }
    }
  }

If you're using Hiera, the same configuration might look like:

  grokmirror::sites:
    'kernelorg':
      pull_remote_manifest: 'https://git.kernel.org/manifest.js.gz'
      pull_site_url: 'git://git.kernel.org'
      pull_include:
        - '/pub/scm/linux/kernel/git/torvalds/*'
        - '/pub/scm/linux/kernel/git/stable/*'

Reference

grokmirror

manage_package

Whether to manage the grokmirror package or not.

Default: true

package_name

Name of the package to install.

Default: python-grokmirror

package_ensure

In case you are not running grokmirror from a package, you can set this to absent.

Default: installed

git_manage_package

Whether to manage the git package or not.

Default: true

git_package_name

Name of the git package to install (e.g. handy if you want git2u from IUS).

Default: git

git_package_ensure

Should we ensure that package is installed or removed?

Default: installed

global_configdir

Where the configuration files for sites should be created. You can override site-specific config file locations in the site config.

Default: /etc/grokmirror

global_toplevel

Where the repositories for each site are going to be placed. E.g. for a site named kernelorg, the location will be global_toplevel/kernelorg. You can override site-specific toplevel locations in the site config.

Default: /var/lib/git

global_logdir

Where to keep the log files. E.g. for a site named kernelorg, the logfiles will be global_logdir/kernelorg-pull.log and global_logdir/kernelorg-fsck.log. You can override site-specific logfile locations in the site config, but you will need to provide your own logrotate handlers.

Default: /var/log/grokmirror

global_loglevel

Loglevel that is inherited by all sites; can be debug or info. You can override site-specific loglevel setting in the site config.

Default: info

user

User that owns the repositories and runs the pull/fsck scripts.

Default: grokmirror

manage_user

Whether to manage the user (set to false if the user is created by another module or is pre-existing).

Default: true

group

Group that owns the repositories.

Default: grokmirror

manage_group

Whether to manage the group (set to false if the user is created by another module or is pre-existing).

Default: true

pull_command

The command that executes grok-pull. If you installed from a package, this will be /usr/bin/grok-pull, but if you are running from a git repository, you can override it here.

Default: /usr/bin/grok-pull

fsck_command

The command that executes grok-fsck. If you installed from a package, this will be /usr/bin/grok-fsck, but if you are running from a git repository, you can override it here.

Default: /usr/bin/grok-fsck

cron_environment

The environment to pass to the cron scripts.

Default: PATH=/bin:/usr/bin

grokmirror::sites

ensure

If present, configures a site, and if absent, will remove the site configuration and cronjobs, but not the mirrored repositories or logfiles.

Default: present

toplevel

Where the repositories for this site will be mirrored.

Default: global_toplevel/sitename

local_manifest

Where to save the local copy of the manifest.

Default: toplevel/manifest.js.gz

pull_enable

Whether to enable the grok-pull configuration. Sometimes you just want to enable frok-fsck runs (e.g. on a git master).

Default: true

pull_configfile

Where to create the config file.

Default: global_configdir/sitename-repos.conf

pull_logfile

Where to store the grok-pull log.

Default: global_logdir/sitename-pull.log

pull_loglevel

Can be used to override global_loglevel. Must be info or debug.

Default: info

pull_remote_manifest

Where the remote manifest for the repositories we are mirroring is located. One of the two required settings that must be provided. E.g. for kernel.org, it is https://git.kernel.org/manifest.js.gz.

pull_site_url

The location of the git server where we are going to be pulling from. E.g. for kernel.org it is git://git.kernel.org.

pull_default_owner

If the remote repository does not specify the owner (to display in gitweb/cgit views), set it to this.

Default: Grokmirror

pull_ignore_repo_references

Never clone with --reference and always create independent clones with no alternates. Safer, but requires dramatically more disk space. Good for backups.

Default: false

pull_projectslist

Where to create the projects.list for cgit needs.

Default: toplevel/projects.list

pull_projectslist_trimtop

See grokmirror documentation for explanation.

Default: undef

pull_projectslist_symlinks

See grokmirror documentation for explanation.

Default: false

pull_post_update_hook

After a repository is updated, run this script. See grokmirror documentation for full details.

Default: undef

pull_purgeprotect

If -p is passed, grokmirror will refuse to purge repositories if more than this percentage of them is to be deleted. A good protection in case the master provided an empty manifest or manifest with greatly reduced list of repositories.

Default: 5

pull_threads

How many git remote update processes to create in parallel. Shouldn't be larger than how many processor threads you have, and requires good random access disk IO speeds.

Default: 5

pull_include

An Array of strings containing shell-globbed list of repos to include in the slave mirror. See grokmirror documentation for full details.

Default: ['*']

pull_exclude

An Array of strings containing shell-globbed list of repos to exclude from the mirror. See grokmirror documentation for full details.

Default: undef

pull_cron_enable

Whether to enable the cronjob running grok-pull on a regular basis. You probably want this, unless you want to only update the mirror on an ad-hoc manual basis. Default is to run grok-pull every 5 minutes.

Default: true

pull_cron_minute

Minutes parameter to pass to cron (must be a String). Can be anything cron understands, e.g. */5 for "every 5 minutes", */20 for every 20 minutes, etc.

Default: */5

pull_cron_hour

The "hour" parameter to pass to cron (must be a String).

Default: *

pull_cron_month

The "month" parameter to pass to cron (must be a String).

Default: *

pull_cron_monthday

The "day of the month" parameter to pass to cron (must be a String).

Default: *

pull_cron_weekday

The "weekday" parameter to pass to cron (must be a String).

Default: *

pull_cron_extra_flags

You probably want to include the -p flag by default, unless you specifically do NOT want to purge as part of the regular cron run (e.g. if you have thousands of repositories and this is too much of a IO hit). If so, set to undef or empty string.

Default: -p

fsck_enable

Whether to enable the grok-fsck configuration. You probably always want to do that if you're doing grok-pull, otherwise your repos will never get repacked and pruned.

Default: true

fsck_configfile

Where to create the grok-fsck config file.

Default: global_configdir/sitename-fsck.conf

fsck_logfile

Where to store the grok-fsck log.

Default: global_logdir/sitename-fsck.log

fsck_loglevel

Can be used to override global_loglevel. Must be info or debug.

Default: info

fsck_lockfile

Where to store the lockfile to ensure that only one grok-fsck instance is running.

Default: toplevel/.fsck.lock

fsck_statusfile

Where to keep the status file for state-tracking between runs.

Default: toplevel/.fsck-status.js

fsck_frequency

How often (roughly) each repository should be fsck'd and repacked/pruned -- in days. See grokmirror documentation for more details.

Default: 30

fsck_repack

Whether to repack the repositories after doing git fsck. You almost always want this on.

Default: true

fsck_repack_flags

VERSIONS OF GROKMIRROR BEFORE 1.2

The repack flags to use when repacking the repository. If you have newer git than 2.1, you should also pass -b --pack-kept-objects to pre-create bitmaps for faster "objects counting" stage. See git-repack and grokmirror documentation for more info.

Default: -Adlq

fsck_full_repack_every

VERSIONS OF GROKMIRROR BEFORE 1.2

Repos should be repacked more thoroughly every now and again, in order to create better deltas. This setting tells grokmirror how frequently this should happen (e.g. 10 means that every 10th repack should be a full repack).

Default: 10

fsck_extra_repack_flags

VERSIONS OF GROKMIRROR STARTING WITH 1.2

Grokmirror-1.2 will figure out the necessary flags to pass to the repack job based on a lot of parameters, but you can add extra ones here if you like, such as --window-memory or --threads.

Default: (nothing)

fsck_extra_repack_flags_full

VERSIONS OF GROKMIRROR STARTING WITH 1.2

You can pass additional flags to a full repack when Grokmirror decides the repository can benefit from it. They are added to the value of extra_repack_flags, so no need to replicate those here.

Default: --window=200 --depth=50

fsck_prune

Whether to prune the repos after repacking (you almost always want this).

Default: true

fsck_precious

VERSIONS OF GROKMIRROR STARTING WITH 1.2

Setting to true will add extensions.preciousObjects=true git configuration to all repositories that are parents to others (via git alternates). Turning this on will help eliminate the possibility of repository corruption, but at a price of keeping all redundant objects on disk forever. Repositories with preciousObjects will still be repacked periodically, but redundant packs and loose objects will never be cleaned up and will be kept around forever.

You probably want to leave this as "false" unless you're running grok-fsck on your git master server.

Default: false

fsck_cron_enable

Whether to enable the cronjob running grok-fsck on a regular basis. You probably want this, unless you want to only run it manually on an ad-hoc basis. Default is to run it every Sunday at 4AM system time.

Default: true

fsck_cron_minute

Minutes parameter to pass to cron (must be a String).

Default: 0

fsck_cron_hour

The "hour" parameter to pass to cron (must be a String).

Default: 4

fsck_cron_month

The "month" parameter to pass to cron (must be a String).

Default: *

fsck_cron_monthday

The "day of the month" parameter to pass to cron (must be a String).

Default: *

fsck_cron_weekday

The "weekday" parameter to pass to cron (must be a String).

Default: 7

fsck_cron_repack_weekday

VERSIONS OF GROKMIRROR STARTING WITH 1.2

Grokmirror-1.2 has an option to run a repack-only cronjob that will identify repositories that can benefit from a repack, but will not fsck them. If you have a large collection that takes a long time to fsck, you can split your regular grok-fsck runs to happen only occasionally, but run --repack-only jobs on a much more frequent basis, such as nightly. Example setting (must be an array due to Puppet's weird treatment of range values): ['1-6'].

Default: undef

fsck_cron_extra_flags

Any additional flags to pass to grok-fsck (none at this time).

Default: undef

fsck_ignore_errors

If git fsck reports benign errors, you can list the match substrings in this array to ignore things you don't really care about (like dangling commits).

Default:

  [
    'dangling commit',
    'dangling blob',
    'notice: HEAD points to an unborn branch',
    'notice: No default references',
    'contains zero-padded file modes'
  ]

fsck_reclone_on_errors

If the fsck process finds errors that match any of these strings during its run, it will ask grok-pull to reclone this repository when it runs next. Only useful for minion mirrors, not for mirror masters.

Default:

  [
    'fatal: bad tree object',
    'fatal: Failed to traverse parents',
    'missing commit',
    'missing blob',
    'missing tree',
    'broken link',
  ]

Limitations

Tested on RHEL 6/7 and CentOS 6/7. Not tested anywhere else. :)