ontohub/hets-agent

Distribute work to versioned workers

Closed this issue · 8 comments

We need to distribute work among workers that have at least version x. This is not trivially possible with RabbitMQ, so we need to find a way to do it.

I propose the following solution:

  • We have a publisher (backend) that requires a minimum worker version
  • We have a number of workers that might have versions that are lower, greater or equal to the required version
  • The workers listen to an exchange-backed queue (broadcast) min-worker-version, on which the publisher publishes the minimum required version
  • Each worker compares its version against that minimum.
    • If the version is greater than or equal to the required version, the worker subscribes to a queue worker-version-xxxxx where xxxxx is the minimum required version
    • Else it does nothing
  • The publisher publishes its work items on the queue worker-version-xxxxx to distribute the workload

To include new workers into the queue, the queue must send the last published version to the new worker. This can be accomplished by using the rabbitmq-recent-history-exchange, which keeps a history of n (default: 20, but in our case I think one would be enough) messages that are sent to new subscribers.

When the publisher is upgraded and requires a new worker version, a new minimum version is published to the min-worker-version queue and all the workers again compare their versions and subscribe to the new queue.

jelmd commented

IMHO, completely wrong. The publisher has a job, which needs to be done. It should NOT care about workers at all, and especially not, how they are implemented or even which version they have, otherwise the whole MQ stuff is more are less useless, just overhead.

Well, the publisher cares about a minimum version of the workers (e.g. it needs features of the workers that are not implemented in earlier versions). The publisher needs to communicate to the workers what it needs.

What the publisher does not care about is how many workers there are, which version a specific worker is running or how the workers perform their task.

The best solution would be to not check for the worker version (or rather the hets version, which is essentially what it will come down to), but for supported worker features. I believe the solution could be applied here too, but it would require some additions to hets (report which features are supported).

What the publisher does now is just saying "Hey, I need workers with a minimum version of xyz." and every worker that is compatible is responsible for listening to messages for that publisher. That's all the publisher does. Announce a requirement and publish the messages; the rest is handled by RabbitMQ and the workers.

jelmd commented

Hmmm, I don't agree. The only thing the publisher needs to care about is specifiying the job, which needs to be done, and possibly the expected outcome properly. The MQ system will publish the job description to an interested client who just accept or rejects it, depending whether he thinks, he can/is allowed to do it (I see a MQS more or less like a black board, whith some colored headlines/boxes).

Just remember, basically there might be several clients, implemented in different languages and all having a different version, because they were neither born on the same place/point in time nor have the same release cycle.

So tying the stuff together by worker version is the opposite of decoupling systems. As said, specifing the job correctly is IMHO the way to go. Whether you specify a min. "ABI", ehhhm API ;-) version to follow, or [probably better] feature tags is a different thing. Haven't heard much details yet about the "distribution" of such messages aka jobs, but usually the client is/should be able to reject jobs, too (i.e. if it detects, that it can't fullfill the promises he made ...). However, handling these things is another POD. And apropos POD: Have you already thought about "load balancing"? Should be discussed as well, because all this has IMHO an influence wrt. MQ system/protocol/workarounds/extensions to use/need to be written ...

PS: Wrt. hets: I think right now it doesn't matter, whether hets spits out the feature tags or the client/controler does the mapping itself (maps version string to a feature set). I think, doing it by itself has the advantage of having more control over it, being able to simplify it and thus being able to use other clients, i.e. not being so much hets tunnelized (perhaps in future there might be clients, which can handle a portion of the "big" job, which needs to be done - key word partioning) ...

If I understand you correctly, you propose the following:

The publisher sends a message describing the job that it wants done and its requirements. The worker that gets the job decides if it is able to complete the job and either accepts or declines the job.

I see two problems with this approach:

  1. Declining jobs is not possible with RabbitMQ afaik. The closest thing you can do is not acknowledging the job, but this will lead to RabbitMQ considering the worker as not available anymore. There is actually a way to decline messages and requeue them.
  2. It would be quite inefficient to basically decline every message a worker gets (1), and let another worker try. If no worker supports the requirements, the message would bounce around workers indefinitely.

(1): we're talking about one backend that publishes messages, and that backend always has the same requirements for the workers (at least until the backend is updated, but this is handled in my proposed solution as well as multiple backends with different requirements)

In regards to load balancing: Jobs are published to queues and workers subscribe to those queues. The distribution of workload between workers subscribing to the same queue happens automatically by RabbitMQ.

You're right, that we might some day have another worker implementation that does the same as hets and that the versioning scheme won't work then, but honestly hets is so complex that I don't see it anytime soon, so we should not worry too much about that now. We could easily send the required features instead of the required version and let the workers decide if they support these features (and then subscribe to queue xyz).

The difference of my proposal to yours is actually not that big: Instead of sending the requirement for a job with the job, we send it ahead of time and create a queue exactly for jobs with this requirement (again: the requirement for jobs does not change while running the backend; it might change when deploying a new version of the backend though) and tell the workers supporting this requirement to listen to this queue (and they can listen to multiple queues, so multiple backends with different requirements is no problem).

jelmd commented

job description yes, requirements wrt. outcome ok (but could be a silent/hidden contract, to which both parties agree). It is like the real world works: If my car needs a TÜV, I just drive to pitstop and tell them I need a TÜV and optionally ask, how long it probably takes. I don't tell them, which mechanic has to do the check, or say that the mechanic needs to be at least 40 years old, or say that he needs this or that certificate [otherwise they would probably respond: yes man, wait, we'll get medical help for you ;-)]. I simply specify the job in a way they understand (get me the TÜV) and trust them, that they can do it in a professional way, because they advertise it this way.

What you describe is actually very different: It is like one advertises jobs for mechanics, which have N years of work experience, this and that certificate, ... and that from time to time they need to do exactly one and always the same job for him: making a TÜV check for a car. So, because they otherwise laying around at home ;-) , in principal the advertiser is more or less their employer and therefore he can hire them directly, to get more flexible, getting jobs done faster ...

So if you wanna do/tie the things as you describe above, a MQS is really just overhead (makes things worse dueto required serialization), doesn't make any sense. In this case a very simple service registry (e.g. https://www.npmjs.com/package/service-registry), where workers can register and the controller can ask it for availability is all what is needed. The controller than can communicate with them directly, w/o any serialization overhead or complex queue/topic/bla selection algorithms, eliminating persistence concerns, etc.. Of course, the controller needs to maintain a list of jobs offered to the workers anyway. And load balancing as Rabbit MQ does, is simple as well: it just increments the cursor in its "list of workers" and moves it to the start, once the end has been reached - so pretty dumb, and IMHO is a really bad one taking e.g. het's memory usage we've seen so far into account [to get this right, one either needs to write an extension for the MQ, or honor this fact by extending the com protocol between controller and workers].

Also you are right, RabbitMQ seems to be not the right solution, because it allows a client to reject a job, but unfortunately it just re-queues it and thus might bomb the same worker again and again. But AFAIK nobody said, that this one has to be taken (or that AMQP is sufficient for this purpose). BTW: I can be wrong, but IIRC RabbitMQ doesn't even have a "dead letter" queue ...

So basically the questions to answer are:

  • Do we really need a MQS?
  • What benefits do we get from it, i.e. what problem[s] exactly does it solve (why)?
  • What kind of complexity/overhead/problems/dependencies does it introduce to the application and environment (why)?
  • Alternatives?

Why would be benefit from a service registry? We would still need to

  • implement a queueing system (which we don't even need to implement with the current version using Sidekiq)
  • implement workload distribution
  • registering and unregistering workers
  • handle worker crashes
  • ...

For a benefit of what?

Message Queueing is exactly what we need. Asynchronous job handling with queues, workload distribution and automatic subscription/crash handling. I really don't see the overhead that would be for us. Using RabbitMQ is dead-simple, while implementing a custom solution that works reliably is not.

By the way: we will be using RabbitMQ for more than just this use case (e.g. sending back notifications about started and finished jobs, possible errors and as a general messaging system between the standalone backend components), so using something different would actually have more overhead (if there even is one with RabbitMQ, which I don't think).

I'm honestly a bit puzzled why you say that my proposition is "completely wrong" when you don't know what our requirements are.

We have decided to use @phyrog's initial solution (however, the queues should be called something like parsing-version-xxxxx, in contrast to proving queues for specific provers, see also spechub/Hets#1688). Moreover, we will file two feature requests to RabbitMQ: 1) RabbitMQ should be able to compare versions, 2) RabbitMQ should be able to match complex service requirements/descriptions. Once this has been implemented in RabbitMQ, we will use that.

jelmd commented

Hmm, a queueing system is nothing but a simple list of more or less sophisticated objects, where one needs to maintain a cursor (for efficiency). Very easy to implement - I guess every student is able to do that.

Workload distribution in terms of rabbitMQ is just incrementing the cursor by one - not challenging at all and smart is something different. If you wanna get proper load balancing you need to implement it by yourself, so ...

[Un]register workers ... A very very simple lib/app like the one mentioned above can do that. You don't even need a full blown service registry, because there is just a single "server" aka controller - only the client side is relevant.

Not sure what "handle worker crashes" means. Just handling timeouts (the controller needs to do that anyway) and re-issue the task to another worker - that's easy to accomplish as well by just setting a timeout handler or the very trivial way to just iterate through the list comparing properties.

So far and taking into account what you've said before, I really can't see any advantage by using a MQS - in this case it just requires duplicating work, which needs to be done anyway. And it unnecessarily complicates the setup/environment/maintainance. Also BTW, it requires ressource which could be used otherwise, to get a little taste: some message system say: min. requirement 100 GB HDD space and 1 GB dedicated memory! ....

And well, MQ is not a magic ball. You need to implement all this stuff on the client and the server, if you want more than just trivial messages like "job xy timeout" or "client n/a" etc. Just using a timer, which wakes up from time to time and check the status of a client is a simple thing to accomplish as well, so nothing, what requires an as well MQ.

Wrt. puzzeld: I can just throw in the ideas/thought I get from the published materials, yepp. But the fact, that on one side you want to use stuff "invented" to decouple things and on the other side want to tie server and client together as tight as possible, is a big contradiction IMHO.

So just far it is just my advise/experience to avoid, that you get into a never ending tunnel, again. You can of course ignore it.