Parallelisation Proposal

Question

Parallelisation Proposal

Closed this issue 5 years ago · 5 comments

While our blocking bash script has got us to over 5000 cloned users and over 5 TB of data, if we want to clone more repositories more quickly, we need a fundamentally different system.

I propose a system of stateless workers coordinated by a single atomic redis instance.

How?

A lock key per username, and a sorted set of users is kept representing the last successful 'update time'.

Locks

key	expire

Membership Set

member	score
montyanderson	0
super3	0

A worker launches finds the user with the lowest score with a lock less than 10 seconds ago, and sets the lock key to now().

> zrangebyscore users 0 -1 LIMIT 1
1) montyanderson
> set lock:montyanderson 'example value' NX PX 10

The worker calls the GitHub API on the user, downloading all of their repositories and associated data, updating the lock to now() every 5 seconds

Locks

key	expire
lock:montyanderson	`now() +- 5 seconds`

On completion the worker copies the repositories to a data store (central server/Storj).

Membership Set

member	score
montyanderson	`now()`
super3	0

Why?

In the event a worker drops out in the process of cloning a user's repositories, a new worker can begin again on that user after the timeout period (10 seconds).
Each worker could have it's own API keys to increase rate limits
Scaling can happen in a near-limitless fashion horizontally

API Calls

`POST` `/lock`

Returns a JSON-encoded username string, representing an unsynced user.
Locked for 10 seconds on request.

`POST` `/lock/:username`

Update lock timeout to 10 seconds from now, returns true if lock not expired.

`POST` `/lock/:username/complete`

Close lock upon successful sync.

Answer 1 · 2019-09-29T15:39:49.000Z

I think this could be written in node pretty easily using nodegit.

Answer 2 · 2019-09-30T12:24:42.000Z

@super3 What are your thoughts on this?

Answer 3 · 2019-09-30T12:39:11.000Z

I've been thinking about it. Wanna jump on zoom?

…

On Mon, Sep 30, 2019, 8:24 AM Monty Anderson ***@***.***> wrote: @super3 <https://github.com/super3> What are your thoughts on this? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#18>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAAO4L3W2AVSVDOWCTKPIZDQMHVYVANCNFSM4I3S4J5Q> .

Answer 4 · 2019-09-30T15:19:08.000Z

@montyanderson Don't forget to document the API calls.

Answer 5 · 2019-10-02T10:39:31.000Z

@super3 Done :)

How?

Locks

Membership Set

Locks

Membership Set

Why?

API Calls

POST /lock

POST /lock/:username

POST /lock/:username/complete

`POST` `/lock`

`POST` `/lock/:username`

`POST` `/lock/:username/complete`