fedmsg should provide support for schemas
Opened this issue · 8 comments
I did some digging and found fedmsg used to have schema support, but it was removed early on. I'd like to propose that we work towards adding them back. Here are some of the benefits I see for supporting schemas:
-
We already have schemas, but they aren't explicit. Schemas are defined in the message processors by the series of
try/except
blocks sharded across many functions. Take, for example, the packages function. Buried in that scary set of blocks is the schema for all pkgdb messages relating to packages. -
When the schema of messages is validated before publishing and after receiving, you can be confident of the message structure and, when the structure changes, it's very clear to both the publisher and receiver. This helps developers avoid situations where they accidentally change their message format and break the world when they deploy the new version.
-
There's a one-stop shop for all message formats and their history.
-
We can make message processing much cleaner. Rather than huge try/except blocks, an interface is defined (basically the
BaseProcessor
) that each schema declaration implements. This way you have one class for a message type that has its schema, and how to get at the information without knowing much about its schema (if you so choose).
I realize this is a large architectural change, but I think it'll make working with fedmsg much easier for developers. What does everyone think?
I won't block this. I think I'm ... neutral on it. :) Some thoughts to bring up:
- For history, @abadger was the one who talked me out of using a schema in the first place. It was a while ago. Maybe he can remember the reasoning?
- FWIW, it might be worth polling some non-infra developers who consume fedmsg messages and ask if they think it is a problem in the first place.
- If you decide to go with this, I've heard good things about jsonschema from the pungi developers.
Cool. To expand upon this a little (and maybe this should be an entirely separate issue since schemas are just a part of it), the current interface is causing me pain in a few ways, but they mostly boil down to the fact that you can throw anything and everything onto the message bus (which, although nice, has a lot of downsides).
The activity I was engaged in when I filed this was trying to update the PkgdbProcessor
to be aware of namespaces. Namespaces got added to messages pkgdb emits a while ago, but this occurred separately from the logic that handles the contents of the fedmsg.
The Process Today
The current process for updating a fedmsg format is:
- Update the code responsible for publishing the message. The message format has now changed \o/.
- Update the processor in fedmsg_meta_fedora_infrastructure. These implement an API defined in
fedmsg.meta.base.BaseProcessor
and there is a processor per application (e.g. there's an AnityaProcessor, a BodhiProcessor, etc). - Hunt down and update any consumers that are directly accessing the fedmsg and aren't using the processor API.
- If you have to modify the processor API in any way, you have to hunt down the users of those APIs across all your applications and update them.
That's a lot of steps to go through and it's easy to forget one (as we all have seen on a regular basis) or miss a user. It also means that, as a consumer, I have to lockstep with the publishing application to the new format.
Potential New Process
What if, instead of having Processors, we have a Message
class? To send a fedmsg, you need to provide a subclass of this Message
. When you receive a message, you get this same Message
object. The Message
defines a similar interface to the Processor
, but individual messages can add APIs, mark some as unsupported, etc. The Message
is where you define your schema. Messages
are mapped to topics.
This would lead to something like this when creating a new message:
- Define your
Message
subclass in your message repository (for the same reason our current formatting happens outside fedmsg and the projects - I don't want to install bodhi to consume its messages). - In the publishing code, import your new
Message
, construct it, and hand it to fedmsg. - In the consuming code, work with that
Message
object which will automatically be created for you by fedmsg based on the topic.
And when you need to update your message:
- Update your
Message
subclass in your message repository. Optionally make sure old APIs for messages continue to work so you can transition to your new message format. - Update your publishing code so it can construct the new
Message
without violating the schema. - If you've broken your
Message
API, you still have to update your consumers 😞.
TL;DR
The basic problem is that we use fedmsg to let applications interface with each other, but there's nowhere to define, document, deprecate, etc. the API. It's really easy to get something out there, but it's really hard to maintain, refine, and improve your interface.
I'm not really breaking new ground here, this is a problem people have recognized long before me and made tools like protocol buffers to handle. Maybe we could leverage these tools. I haven't done an in-depth investigation to say whether that's something worth-while or not.
Anyway, those are my meandering thoughts. They're certainly in need of refinement, and quite possibly not worth acting upon.
TL;DR for the TL;DR
😢
We can make message processing much cleaner.
Just a quick note on this, since we have to still support the past messages, we may gain on processing newer messages but we will need to keep the current code in place, so we may end up adding code for the new schemas without removing any/much.
One thing that @abompard mentioned once and which is totally doable and likely fairly easy is just to add a version to the message. This way (except for miss/bugs) we can easily bump the version in the producer and adjust the behaviour of the consumer accordingly.
I'm a +1 for having schemas on our messages. The protocol buffers thing looks nice.
I've done some investigation about how this API might get implemented. It was satisfying to see the pyzmq documentation recommend the approach I had in mind, but unfortunately there's a bit of a snag.
The problem is how fedmsg is abstracting the ZMQ underpinnings. ZMQ messages are published by fedmsg. However, the subscriber code (including the bits that would let us manipulate incoming messages prior to handing them to consumers) lives in moksha.
It seems (from my investigation of moksha) that we use it to support various messaging technologies besides ZMQ. However, the fedmsg documentation does not give any indication (that I can find) that this is a focus of fedmsg, and within Fedora Infrastructure we don't use it (as far as I know).
This leads me to ask a few questions:
-
What's the focus of fedmsg? Do we care about supporting many messaging technologies?
-
Does moksha offer anything else?
fedmsg has now code that allows using it with another message bus than zmq, @ralphbean added some changes for this recently among others in #380 and #387
I guess what I'm driving at is what does fedmsg want to be. Does it aim to be a high-level messaging library very much like kombu? Does it want to focus exclusively on ZMQ and make that experience very easy and clear? Something else?
I've used fedmsg quite a bit now and I've read the docs, but I don't know what fedmsg's goal is, exactly.