OokTech/TW5-Bob

Message Queue Timeout can cause Redundant Save Of Doom Loop

Closed this issue · 5 comments

In SharedFunctions.js, in Shared.checkMessageQueue, there is a 500ms timeout to resend any message that hasn't been acknowledged yet. Some messages (e.g. saving imported images) may easily take longer than 500ms to be acknowledge, esp. when the server is heavily loaded, etc.

When this happens the timeout tiggers a retransmission of the same slow-to -process message, which in turn fails to complete in 500ms, triggering yet another retransmission and further hammering the poor server. Hilarity ensues.

In my specific use case, raising the timeout to several seconds (allowing any reasonable sized upload to complete) "fixes" the problem. A better solution would be...better. Unfortunately, all of the obvious ideas seem to have their own drawbacks and I'm not having any clever ideas at the moment.

Also, in WebSockets/WebsocketAdaptor.js lines 191-196 there is a note about it getting stuck in an "infinite saving loop" that may actually be a result of this issue instead.

I am running (check any that apply, put an x inside the [ ] to check a box, like this: [x]):

  • Windows
  • OSX
  • Linux
  • Other

and using

  • The nodejs version
  • The single file executable

Before posting I read issue guidelines and:

  • I am using the newest version
  • The answer to my question isn't listed in the documentation or this isn't
    a question
  • This is not a duplicate issue
  • I have not done anything that required me to set acceptance to
    I Will Not Get Tech Support For This

Thank you for the report! I have been trying to track down the cases where this happens.

In theory the server is supposed to stop retrying after it receives an acknowledgement for any of the times it tries to send a message, but that may not be working. That wouldn't prevent all unneeded re-transmissions but it should at least keep it from being an infinite loop.

One of the simpler to implement fixes would be to test the round-trip ping time and then adjust the time until retransmission based on that and the message size.

My hope is that between those two improvements this problem will at least be very rare.

Another potential improvement would be to set a maximum message size and split anything larger than that size into chunks and send them individually so instead of resending the entire message each time it could send smaller messages. I am not certain that this actually solves the problem but it may be worth testing.

These are mostly notes for me.

Another potential cause of the delay instead of, or in addition to, network speed is checking for changes between the incoming message and an existing tiddler, if any.
To keep this from being a problem a hash of the tiddler content is attached to the transmitted messages and if there is an existing tiddler the hashes are compared instead of testing the content directly.
This may speed up the process.

Another option suggested by this is before sending a message that is over some threshold a small message containing the hash of the larger message is sent and the receiver responds saying if it needs the update or not. I think that regardless of the current bug this could reduce the network through-put and help out with slower connections.

This has been happening to me repeatedly recently, with just the FileSystemPaths setting and Bob, and when I check the box to display the Wiki tab in the sidebar, sometime it will loop checking and unchecking infinitely, or when loading the wiki the tab will be there but the checkbox will be unchecked. Really not sure what's causing this... It might be specific to having the FileSystemPaths tiddler setup & thus *.tid files being stored in sub-folders, or a timing issue related to the bug in this thread.....

Also, not getting the same behavior from the last Bob.exe.

Testing and I realized something. The 'Saved Tiddler' log messages were being doubled up....and I have all of my personal data saved on a non-OS local logical partition drive (D:), and my OS and VSCode (bash terminal) is on C:. The errors disappeared when Bob.exe was ran FROM the D: drive. On a hunch, I turned off the File System Watcher by setting "disableFileWatchers" to "yes" in settings.json, and re-launched through my bash terminal/VSCode.

The issues have not (yet) reoccurred. This might be related to the "networked drives on windows" issue where you recommend disabling that setting: #116 (comment)

Bob switched to use the core syncer, so the component causing this problem no longer exists.