LabVIEW-Open-Source/LV-MQTT-Broker

Rerunning the same client sequentially can cause error 10

russellgsystems opened this issue · 17 comments

I have a client VI similar to the example (but it only waits for specific topic and then stops and closes the client). If I run the VI and let it stop and then run the VI again I sometimes get error 10 (see screenshot below). I am not stopping the server at all. I am wondering if you have seen this issue and have ideas on how to address it.

image

Can you upload an example code reproducing this issue? I'd like to check if the error is transmitted with the suback packet.

I assume you subscribe to the same topic again, not a new topic and this re-subscription is not happening in a fast loop.
If sometimes it's not valid, it might be that the subscription engine does not clear the list of stale connections correctly, or in a timely manner.
In your connect packet, do you assign the same name to the client? What are the connect flags?

Attached is some example code showing the problem. To reproduce the problem run the Server.vi and Process.vi. Then run the startprocess.vi. It will run once and stop. Then run it again. Continue this process until you see the error. Typically it only takes a couple of runs of the startprocess.vi to see the error 10 returned. The reason the startprocess.vi only runs once is because the end goal is to include this VI as a step in TestStand that will send a request like start process and then wait for the response in the same step and then move on. This might not be the best way to do this so I am open to your ideas on implementation.

Error 10.zip

@russellgsystems There are two things going on here.
The first is simple. There is an error in the Subscribe Packet class where the "getTopicSubscriptions" method uses a "Simple Error Dialog" instead of "Append Error". This is why you see a popup instead of an error being propagated in the wire!

image

This must be changed, however this is not the issue.
If you add a 10ms delay after the "Start" published package, the subscription to "Started" message works fine... but... this means there is a race condition in the subscription process and I also noticed another problem running in a loop:

By doing so in a 10ms loop, I noticed a very real lag in the responsiveness of the server, a bit like there is a growing array of objects (I suspect subscriptions list) and the regex match takes more and more time to execute. This is not normal. When a client disconnects, its subscriptions are destroyed, but I suspect the server keeps them in the list, associated with an invalid subscription reference.

The lag grows exponentially and becomes critical in the range ~150 to 250 iterations. At this point, it cannot keep up with the ~20ms loop time oif the whole code. If your TestStand script calls numerous times the code, it will definitely be a problem.

I knew that a lag would occur in this engine until I implement a caching and prioritization strategy for pattern matching and frequent publication on the same subscription, but I thought it would occur only when a large number of concurrent connections were allowed. I did not anticipate it would be a problem with a single client that connects and reconnects.

My gut feeling is that I partially implemented the purge of the subscriptions but did not go all the way because MQTT supports client reconnection without having to resubscribe (through persistent session flag). This feature of MQTT is not yet implemented in this server and would likely change the way this purge is done. It either that, or the stored sessions that is buggy.

Anyhow, I'll investigate further... The problem is in the server core engine. Unsubscribing does not seem to have an effect.

In the meantime, here's a possible workaround:

  • If you are not calling it 100 times around, you can simply add a 10ms magic delay fairy to allow subscription process to finish before you call the Start function. This will work but of course, it means the problem is still there and needs to be solved.
  • the only way I found to remove the lag, so far, is to restart the server.

@russellgsystems I found that the culprit is the client process which does not stop correctly, leaving clients n = 2, 3. ... running stale. They are waiting forever to be stopped. The lag is in the management of clones because the number increases linearly. I didn't have the time to find the fix today, but it's on its way. Stay tuned.

@francois-normandin Thank you so much looking this issue so quickly and I am so glad you figured out the source of the problem.

@francois-normandin Thank you again for your help on this issue. I put in the 10 ms delay as shown in the picture below, but I am still seeing error 10. I tried increasing it all the way to 10 seconds, but I still get the error. Should I put it in a different place?
image

For a temporary fix, you need to replace the error dialog node in the Subscribe.lvclass:getTopicSubscription.vi

as shown here: #151 (comment)

Delete the "Error Dialog" altogether. This error is, at the moment, ignored... but the Error Dialog is not supposed to be there. It is a mistake.

And sorry I have not yet built a new package... I'm pretty busy... :-(

Thank you I don't get the popup now, but still sometimes I don't get a message back and my event structure times out. I think your fix will solve this problem. I appreciate all of your help.

@francois-normandin no rush on the fix for the issue where sometimes the messages aren't returned. I have a workaround. Since the issue is with the subscription of the topics I just made a separate functional global where I subscribe to all topics instead of subscribing to just the one I want to see, which avoids the multiple subscription problem. This works for now. Thanks again for your help.

Great that you found a workaround for now. I expect to have a package built this weekend. I'm sure I can count on you as a early adopter ;-)

@francois-normandin Thank you so much! You can definitely count on me to be an early adopter. We are using this on an active project. I noticed there was a vipc and vipb on this site, but I didn't see a vip file. When you build the package, can you please distribute the vip file under packages like in the screenshot below?

image

The Github packages are for NPM, Docker and the likes.
I've kept the VIP files as part of the "Assets" associated with releases (attached to a tag).

image

1.0.2's package is there too.
I will likely distribute the future releases through VIPM Community as well.
Stay tuned!

image

@francois-normandin great that makes sense and I found the vip. Distributing it through VIPM Community would be great too. Again thank you for all your help!

@russellgsystems
Version 1.0.4 is released and fixes the bugs reported.
Your example brought to light a few performance improvements that were needed operate in a high-speed connect-disconnect mode, which this version provides.

For the record, I've created a benchmark example code in the source.
The example recreated the demonstration you showed, but with some timing hooks.
I ran the following actions:

  • Create client
  • Connect to server (clean session flag = true)
  • Subscribe to "Started" topic"
  • Publish "Start" topic
  • Wait for suback response
  • Unsubscribe from "Started" topic
  • Disconnect from server
  • Destroy client

These actions were run in a loop which achieved 20ms period (server on the same host machine).
I experienced a single reply failure in > 5000 messages. Based on the loop time graph, it must have timed out (100ms) on waiting for the suback response. I assume that increasing the timeout would prevent this type of failure from occurring until such time that QoS = 1 and 2 are implemented.

@francois-normandin thank you so much! I will test it and incorporate the new changes.