Parallelisation Proposal
Closed this issue · 5 comments
While our blocking bash script has got us to over 5000 cloned users and over 5 TB of data, if we want to clone more repositories more quickly, we need a fundamentally different system.
I propose a system of stateless workers coordinated by a single atomic redis instance.
How?
- A
lock
key per username, and a sorted set of users is kept representing the last successful 'update time'.
Locks
key | expire |
---|
Membership Set
member | score |
---|---|
montyanderson | 0 |
super3 | 0 |
- A worker launches finds the user with the lowest score with a lock less than 10 seconds ago, and sets the lock key to
now()
.
> zrangebyscore users 0 -1 LIMIT 1
1) montyanderson
> set lock:montyanderson 'example value' NX PX 10
- The worker calls the GitHub API on the user, downloading all of their repositories and associated data, updating the lock to
now()
every 5 seconds
Locks
key | expire |
---|---|
lock:montyanderson | now() +- 5 seconds |
- On completion the worker copies the repositories to a data store (central server/Storj).
Membership Set
member | score |
---|---|
montyanderson | now() |
super3 | 0 |
Why?
-
In the event a worker drops out in the process of cloning a user's repositories, a new worker can begin again on that user after the timeout period (10 seconds).
-
Each worker could have it's own API keys to increase rate limits
-
Scaling can happen in a near-limitless fashion horizontally
API Calls
POST
/lock
Returns a JSON-encoded username string, representing an unsynced user.
Locked for 10 seconds on request.
POST
/lock/:username
Update lock timeout to 10 seconds from now, returns true
if lock not expired.
POST
/lock/:username/complete
Close lock upon successful sync.
I think this could be written in node pretty easily using nodegit.
@super3 What are your thoughts on this?
@montyanderson Don't forget to document the API calls.
@super3 Done :)