Particular/ServiceControl

Add forwarding of heartbeat & startup messages

marcselis opened this issue · 8 comments

We are a large company with a central operations team that monitors nServiceBus through a central ServicePulse/ServiceControl installation. However, some development teams prefer to do their own support, so we have set-up a separate ServicePulse/ServiceControl installation for them. We configured their ServiceControl instance to forward all messages to the error & audit queues of the central ServiceControl so we can keep an overview of everything that is going on in the system.

This works fine, except for the heartbeat and startup messages. In the central ServicePulse instance, none of the instances of the endpoints monitorred by the decentralized ServiceControl show up, until a message was sent by those endpoints. And after that, they show up as 'inactive' in the central ServicePulse.

If the decentralized ServiceControl instances would forward the startup & heartbeat messages they receive to the central ServiceControl, these issues would be solved.

Hi @marcselis

Thanks a lot for sharing your usage of ServiceControl with us.
I have to be honest, I have never thought of using ServiceControl the way you describe. A few concerns I have is that since you are in essence using multiple ServiceControl instances to monitor the same error messages, this mean you can retry the same error message multiple times, which can cause duplication of data in your own system, is this something you are aware of ?

Regarding your requirements, the way we enable this scenario to work is by using the ServiceControl events, have you consider using this instead ?

Hi @johnsimons

We have tested this setup and in essence everything works as expected. The successful retry of a message from one ServiceControl instance is also replicated to the other, so the error will disappear there also. But there is indeed a short moment where you have the possibility to retry the same error from different ServiceControl instances if the message of the successful retry is not yet received by the other.

We are aware of this risk and can live with it, as in our situation the support of the decentralized system is completely taken over by the development team. The operations team will never retry messages from those systems. Their role will be limited to monitoring that the errors get resolved and alerting the development team if they don't.

Thanks for the suggestion to look into ServiceControl events to solve our problem. Is it possible to let one ServiceControl instance directly subscribe to the events of another? Or do we need a bridge component between them?

The ServiceControl events are inadequate for our goals as there are only events fired when Heartbeats are stopped and started, and not for every heartbeat. We solved this issue by crafting a special version of the heartbeat plugin that can send messages to multiple ServiceControl instances.
This way both the team-specific & central ServiceControl instances receive the heartbeats.

Not an ideal solution as now both the endpoints and team-specific ServiceControl have knowledge about the central ServiceControl, but at least this works.

In an ideal solution we would have only one central ServiceControl instance and we would have a way to organize endpoints in some kind of hierarchy and assign teams (active directory groups) to a certain level so they can only monitor and manage messages from that and any lower level.

@marcselis really cool setup you guys are running.

So I think your last comment really highlights the requirement:

In an ideal solution we would have only one central ServiceControl instance and we would have a way to organize endpoints in some kind of hierarchy and assign teams (active directory groups) to a certain level so they can only monitor and manage messages from that and any lower level.

@Particular/servicecontrol-maintainers how do we manage this, raise it in plat dev ?

@marcselis @johnsimons one possible workaround while we triage this feature could be to use the NServiceBus forwarding feature. If I recall correctly that feature works only on the main input queue, this means that all the incoming heartbeats, custom checks and startup messages will be forwarded to the central ServiceControl as well. The issue that this can introduce is that also messages such as archive group or retry failed message are forwarded to the central ServiceControl.

how do we manage this, raise it in plat dev ?

yes, it is another requirement that can be listed in the platform componentization effort, e.g when talking about scale out/ha/bounded contexts.

This issue has been raised internally in plat dev.

@marcselis thanks a lot for starting this discussion, the way we manage this kind of suggestions, is by raising them internally in a private repo, where we prioritise it and manage it from now on.
So with that in mind, I'll close this suggestion for now, and once we are ready to act on it we will reopen it. This does not mean we will not be working on it.

It seems this feature request was lost so reopening