janstey/repour

Repour - Archival Code Service

Repour archives source code from many untrusted origins into trusted git repositories. It supports SCMs like git, hg, and svn, as well as archive formats like tar and zip. Tools like PME can optionally be applied as adjustments, and the results archived alongside the unmodified sources.

Why the name "Repour"? First, it contains the common short word for "repository" ("repo"); second, the action of "repouring" could be used to describe what the program does to the repositories.

Ethos

Repour is not a source repository mirroring service, it archives source code for builds. This may sound the same but there is a big difference. Repour solves an entire class of problems surrounding the unreliability of "SCM coordinates" (a SCM url and commit/reference/revision), that a mirror is not able to solve.

Mirrors are flawed

A mirror can only be an exact copy of the upstream repository. Critically, any modifications to existing objects must be propagated, or the mirror ceases to be a mirror. If the upstream repository is not history protected, then neither the upstream repository or the mirror are suitable to build from. If the upstream repository is history protected, then there is no need for a mirror (except to ensure server availability). However, since virtually all upstreams are not history protected, or only weakly protected, a different solution is required.

The single requirement

How can the mutability conflict be handled if a mirror can’t solve it? Strip everything back to the minimum required and go from there.

The only firm requirement of storing source code for builds is that the same coordinates, point to the same tree of files, forever. Everything else is a nice-to-have, including the upstream history.

Repour as part of a build system

This section describes how Repour is intended to be employed within the collection of services that form a build system.

What Repour will do and not do

Repour does not mirror or in any way store more from the origins than the files that are required to build from. In exchange Repour provides these guarantees:

Coordinate Immutability

The git URL and tag Repour returns will always be valid, and will always point to the same file tree. This is an important part of ensuring builds can be reproduced.

File Tree Deduplication

For the same name and file tree, the same tag will be returned, as was returned the first time the tree was seen. This can be used to avoid rebuilding the same sources.

No Conflict Intervention

The overall design means there are no history conflicts, so manual intervention is never required to archive sources. This is an important part of a highly automated build system.

What the repository provider must do and not do

Repour needs a git repository provider that grants it the following permissions:

Create repositories
Clone repositories
Push new tags
Push commits to new branches
Push fast-forward commits to existing branches

Also, the repository provider must not grant the following permissions to Repour and all other users:

Delete repositories
Delete tags or push existing tags (re-tagging)
Push non-fast-forward commits (force push)

In this configuration, commits and tags are immutable in the repository, i.e. history is protected.

What the rest of the build system must do and not do

Since Repour is probably not exposed at the edge of the build system, other parts will have access to the raw SCM coordinates before Repour processes them. For the vast majority of cases, the rest of the build system should not attempt to interpret the raw SCM coordinates. The coordinates should be merely passed along, so there is no duplication of effort, or bad assumptions made about mutability.

Once Repour is called, it returns a set of git coordinates that can be used by the rest of the build system. See What Repour will do and not do.

Direct clients

Repour /pull has a name parameter, which impacts which repository the sources end up in. As the logical grouping of sources is beyond the scope of Repour, it is left to the caller to determine. However, unique names result in unique deduplication namespaces, so if the caller wants to use Repour’s deduplication guarantee optimally, a minimal set of names should be used.

Repour would be perfectly happy if it were given the same name for every request (regardless of how related the sources are), so lean towards too few unique names rather than too many.

Builder clients

When a builder host gets the sources, it won’t communicate with Repour, but with the git repository provider where Repour has stored the sources.

The builder should not assume there is a default branch (master for example) in the repository, there won’t be one there, as the repository is a collection of orphan branches.

Repour specifically returns a tag instead of a commit ID so the builder can perform a fast shallow clone. Cloning the full repository is of no benefit to the build, as it uses only the single file tree. Also, the full clone time will grow linearly with the number of unique file trees stored in the repository. This is the recommended git command to use:

git clone --depth 1 --branch $tag $readonlyurl

Note that --branch is only usable with references (tags or branches), not commit IDs. The builder typically should not clone from a branch name as this is inherently mutable.

Interface

Pull

URL

/pull

Request (SCM)

Method	POST
Content-Type	application/json
Body (Schema)	{ "name": Match('^[a-zA-Z0-9_.][a-zA-Z0-9_.-]*(?<!\.git)$'), "type": Any("git", "hg", "svn"), "ref": nonempty_str, "url": Url(), Optional("adjust"): bool, Optional("callback"): { "url": Url(), Optional("method"): Any("PUT", "POST"), }, }
Body (Example)	{ "name": "teiid", "type": "git", "ref": "teiid-parent-8.11.0.Final", "url": "https://github.com/teiid/teiid.git" }

Request (Archive)

Method	POST
Content-Type	application/json
Body (Schema)	{ "name": Match('^[a-zA-Z0-9_.][a-zA-Z0-9_.-]*(?<!\.git)$'), "type": "archive", "url": Url(), Optional("adjust"): bool, Optional("callback"): { "url": Url(), Optional("method"): Any("PUT", "POST"), }, }
Body (Example)	{ "name": "teiid", "type": "archive", "url": "https://github.com/teiid/teiid/archive/teiid-parent-8.11.0.Final.tar.gz" }

Response (Success)

Status	200
Content-Type	application/json
Body (Schema)	{ "branch": str, "tag": str, "url": { "readwrite": Url(), "readonly": Url(), } Optional("pull"): { "branch": str, "tag": str, "url": { "readwrite": Url(), "readonly": Url(), } } }
Body (Example)	{ "branch": "pull-1439285353", "tag": "pull-1439285353-root", "url": { "readwrite": "file:///tmp/repour-test-repos/example", "readonly": "file:///tmp/repour-test-repos/example" } }
Body (Example adjust)	{ "branch": "adjust-1439285354", "tag": "adjust-1439285354-root", "url": { "readwrite": "file:///tmp/repour-test-repos/example", "readonly": "file:///tmp/repour-test-repos/example" }, "pull": { "branch": "pull-1439285353", "tag": "pull-1439285353-root", "url": { "readwrite": "file:///tmp/repour-test-repos/example", "readonly": "file:///tmp/repour-test-repos/example" }, } }

Response (Invalid request body)

Status	400
Content-Type	application/json
Body (Schema)	[ { "error_message": str, "error_type": str, "path": [str], } ]
Body (Example)	[ { "error_message": "expected a URL", "error_type": "dictionary value", "path": ["url"] }, { "error_message": "expected str", "error_type": "dictionary value", "path": ["name"] } ]

Response (Processing error)

Status	400
Content-Type	application/json
Body (Schema)	{ "desc": str, "error_type": str, "error_traceback": str, str: object, }
Body (Example)	{ "desc": "Could not clone with git", "error_type": "PullCommandError", "error_traceback": "d41d8cd98f00b204e9800998ecf8427e", "cmd": [ "git", "clone", "--branch", "teiid-parent-8.11.0.Final", "--depth", "1", "--", "git@github.com:teiid/teiid.gitasd", "/tmp/tmppizdwfsigit" ], "exit_code": 128 }

Adjust

URL

/adjust

Request

Method	POST
Content-Type	application/json
Body (Schema)	{ "name": Match('^[a-zA-Z0-9_.][a-zA-Z0-9_.-]*(?<!\.git)$'), "ref": nonempty_str, Optional("callback"): { "url": Url(), Optional("method"): Any("PUT", "POST"), }, }
Body (Example)	{ "name": "example", "ref": "pull-1436349331-root" }

Response (Success)

Status	200
Content-Type	application/json
Body (Schema)	{ "branch": str, "tag": str, "url": { "readwrite": Url(), "readonly": Url(), } }
Body (Example)	{ "branch": "adjust-1439285354", "tag": "adjust-1439285354-root", "url": { "readwrite": "file:///tmp/repour-test-repos/example", "readonly": "file:///tmp/repour-test-repos/example" } }

Response (Invalid request body)

Status	400
Content-Type	application/json
Body (Schema)	[ { "error_message": str, "error_type": str, "path": [str], } ]
Body (Example)	[ { "error_message": "expected a URL", "error_type": "dictionary value", "path": ["url"] }, { "error_message": "expected str", "error_type": "dictionary value", "path": ["name"] } ]

Response (Processing error)

Status	400
Content-Type	application/json
Body (Schema)	{ "desc": str, "error_type": str, str: object, }
Body (Example)	{ "desc": "Could not clone with git", "error_type": "PullCommandError", "cmd": [ "git", "clone", "--branch", "teiid-parent-8.11.0.Final", "--depth", "1", "--", "git@github.com:teiid/teiid.gitasd", "/tmp/tmppizdwfsigit" ], "exit_code": 128 }

Clone

Checkout a git ref from the origin repo and force push it to the target repo. If ref is not a branch name, new branch named branch-{ref} pointing to the ref will be pushed instead. The response will contain the resulting ref name.

URL

/clone

Request

Method	POST
Content-Type	application/json
Body (Schema)	{ "type": "git", # only git supported for now "ref": nonempty_str, "originRepoUrl": Url(), "targetRepoUrl": Url(), Optional("callback"): { "url": Url(), Optional("method"): Any("PUT", "POST"), } }

Response (Success)

Status	200
Content-Type	application/json
Body (Schema)	{ "type": "git", # only git supported for now "ref": nonempty_str, "originRepoUrl": Url(), "targetRepoUrl": Url(), }

Clone Adjust

Checkout a git ref from the origin repo and push it to the target repo. If the ref is a branch, the branch will be pushed to the target repo. If the ref is a tag, the tag will be pushed to the target repo. If the ref is a SHA, then a branch will be created with name 'branch-<ref>' and pushed as a branch to the target repo. The latter is required to get a particular commit to the target repo.

You can specify an 'adjust' option to set if you want to run PME or not on it. You can also specify the 'adjustParameters'. The result of the PME run, together with the tag information is found in the response, under key 'adjust_result'. The value of 'adjust_result' is the same as for '/adjust'.

The response will contain the exact ref name as what was sent

Callback mode

All endpoints can operate in callback mode, which is activated by defining the optional callback parameter. In this mode an immediate response is given instead of waiting for the required processing to complete.

A request that does not pass the initial validation check will return the documented "Invalid request body" response. Otherwise, the following response will be sent:

Status	202
Content-Type	application/json
Body (Schema)	{ "callback": { "id": str, }, }
Body (Example)	{ "callback": { "id": "YQSQOIGKB3TPJPB7Q6UARPULTASTXW7WOZF2JZCXLGQCBYSE" } }

The body of the usual "Success" or "Processing error" response will then be sent at a later time, as an HTTP request to the URL specified in the callback request parameter. A "callback" object will be added, containing the status code and the ID string previously returned.

Method	POST (by default, or PUT if so specified)
Content-Type	application/json
Body (Schema)	{ object: object, "callback": { "status": int, "id": str, }, }
Body (Example)	{ "branch": "pull-1439285353", "tag": "pull-1439285353-root", "url": { "readwrite": "file:///tmp/repour-test-repos/example", "readonly": "file:///tmp/repour-test-repos/example" }, "callback": { "status": 200, "id": "YQSQOIGKB3TPJPB7Q6UARPULTASTXW7WOZF2JZCXLGQCBYSE" } }

Callback websocket for live logs

All endpoints that operate in callback mode can be eligible for live logs via websockets. Once the callback id is obtained, you can establish a websocket connection to /callback/{callback_id}. The server will then push any logs back to the client. The logs are in string format.

Docker Images and Open Shift

There are two docker images defined in this repository:

Dockerfile, the main image containing the Repour server.
Dockerfile.gitolite, a default repository provider image. Mostly intended for testing or non-critical production use.

The docker images can be run in plain Docker or OpenShift. Some less-than-ideal design choices were made to fit the applications into the OSE-compatible containers:

pid1.py is the entrypoint of both images, and remains running for the life of the container. It works around the "Docker PID1 Zombie Problem" by reaping adopted children in addition to the primary child defined by its arguments.
au.py runs second in both images, but finishes with an exec call, so it doesn’t remain running. It detects if the container UID has been forced to a non-existing user (as OpenShift does). If so, it activates nss_wrapper so git and ssh can continue to operate.
gitolite_et_al.py runs third in Dockerfile.gitolite only, it configures and starts the three processes required for gitolite.
- The HTTP and SSH servers can’t be split into seperate images because OSE does not allow containers to share persistent volumes
- The lack of shared persistent volumes in OSE also means the container is not scalable
- The third process in the container is tail, it reads the gitolite log so OSE can see it on the container stdout.
- The configuration can’t be included in the image because the working directory is intended to be the persistent volume mount, which will start empty in OSE.

Locally Simulated OSE

The integration tests use the Docker images in an OSE-like environment. To do something similar yourself, you first need a volume mount that is structured the same as an OpenShift Secret volume would be:

mkdir -p /tmp/secrets/repour /tmp/secrets/admin
ssh-keygen -f /tmp/secrets/repour/repour -N ""
ssh-keygen -f /tmp/secrets/admin/admin -N ""

Note the real Secret volume could actually be multiple volumes (mounted at /mnt/secrets/repour and /mnt/secrets/admin), so least privilege can apply. Dockerfile.gitolite only needs to know the public keys of both users, and Dockerfile only needs to know the private key of the repour user. Neither needs to know the admin user’s private key.

Then start both images, mounting the volume as shared. Dockerfile needs some environment variables: where to find the repository provider, and a REST endpoint required by PME (provide a dummy value if not using adjust)

docker run --volume "/tmp/secrets:/mnt/secrets:z" --user="$UID" -d --name repour_git repour_integration_test_git:latest
docker run --volume "/tmp/secrets:/mnt/secrets:z" --user="$UID" -d --link repour_git:git --name repour -e "REPOUR_GITOLITE_HOST=$(docker inspect --format '{{ .NetworkSettings.IPAddress }}' repour_git)" -e "REPOUR_PME_DA_URL=foo" repour_integration_test:latest

Operating Mode B

The above sections describe Repour’s recommended, default, operating mode ("A"). Repour can also run in operating mode "B", which offers some of the same functionality without the same guarantees that mode A offers.

Mode B changes the API such that creating and maintaining the internal repositories is the responsibility of the calling user, instead of Repour itself.

Cautions

Repour may not have write access to the given repositories, so there is an extra failure mode added to all operations.
Repour cannot assert anything about the repositories, so there is no guarantee of coordinate immutability, build reproducibility, or any sort of repository consistency.
Repour’s test suite will be unable to confirm repository creation works as expected, so repository creation will be untested.

Enabling Mode B

Add --mode-b to the command line.

Interface changes

For both /pull and /adjust, replace the "name" parameter with "internal_url".

Partial Body (Schema)	{ "internal_url": { "readwrite": Url(), "readonly": Url(), } }
Partial Body (Example)	{ "internal_url": { "readwrite": "ssh://git@example.com/somerepo.git", "readonly": "http://example.com/somerepo.git" } }

Development

Local Server Setup

Prerequisites

Python 3.4.1+
pip
Git 2.4.3+
Mercural (optional, for hg support)
Subversion (optional, for svn support)
Docker 1.7.1+ (optional, for integration tests)

Setup the virtual environment with vex

Install vex for the current user with pip3 install --user vex
Ensure $PATH includes $HOME/.local/bin
Install the required C libraries with system package manager. On Fedora: dnf install python3-devel python3-Cython libyaml-devel
vex -m --python python3.4 rpo pip install -r venv/runtime.txt
Optionally: vex rpo pip install -r venv/integration-test.txt

Recreating the virtual environment

Delete the old environment with vex -r rpo true
Rerun the vex pip install commands

Configure

Copy the example configuration in config-example.yaml to config.yaml, then edit.

Start the server

vex rpo python -m repour run

For more information, add the -h switch to the command.

Tests

Unit Tests

Unit tests are self-contained and work without an internet connection. To run them:

vex rpo python -m unittest

Integration Tests

GitLab integration tests will be executed using the local Docker server. To run them:

ensure your vex environment includes venv/integration-test.txt
prefix REPOUR_RUN_IT=1 before the unittest command, to set the triggering environment variable. For example: REPOUR_RUN_IT=1 vex rpo python -m unittest

License

The content of this repository is released under the ASL 2.0, as provided in the LICENSE file. See the NOTICE file for the copyright statement and a list of contributors. By submitting a "pull request" or otherwise contributing to this repository, you agree to license your contribution under the license identified above.