super3/gitbackup

Parallelisation Proposal

Closed this issue · 5 comments

While our blocking bash script has got us to over 5000 cloned users and over 5 TB of data, if we want to clone more repositories more quickly, we need a fundamentally different system.

I propose a system of stateless workers coordinated by a single atomic redis instance.

How?

  • A lock key per username, and a sorted set of users is kept representing the last successful 'update time'.

Locks

key expire

Membership Set

member score
montyanderson 0
super3 0
  • A worker launches finds the user with the lowest score with a lock less than 10 seconds ago, and sets the lock key to now().
> zrangebyscore users 0 -1 LIMIT 1
1) montyanderson
> set lock:montyanderson 'example value' NX PX 10
  • The worker calls the GitHub API on the user, downloading all of their repositories and associated data, updating the lock to now() every 5 seconds

Locks

key expire
lock:montyanderson now() +- 5 seconds
  • On completion the worker copies the repositories to a data store (central server/Storj).

Membership Set

member score
montyanderson now()
super3 0

Why?

  • In the event a worker drops out in the process of cloning a user's repositories, a new worker can begin again on that user after the timeout period (10 seconds).

  • Each worker could have it's own API keys to increase rate limits

  • Scaling can happen in a near-limitless fashion horizontally

API Calls

POST /lock

Returns a JSON-encoded username string, representing an unsynced user.
Locked for 10 seconds on request.

POST /lock/:username

Update lock timeout to 10 seconds from now, returns true if lock not expired.

POST /lock/:username/complete

Close lock upon successful sync.

I think this could be written in node pretty easily using nodegit.

@super3 What are your thoughts on this?

@montyanderson Don't forget to document the API calls.