catch+release errors in pruning loop

Question

catch+release errors in pruning loop

warner opened this issue 5 years ago · 0 comments

I noticed just now that the mailbox server ran into some sort of database problem about twelve hours ago. The symptom was that my munin graphs (including the wormhole_active one that just counts active nameplates) stopped producing data, and there were an awful lot of nameplates in use (about 28).

There were two kinds of errors. The first was a sqlite3.IntegrityError: FOREIGN KEY constraint failed. This happened inside Mailbox.close, in response to the websocket connection being closed. I believe this caused the in-memory "is this connection still open?" flag to remain set, causing subsequent attempts to prune the nameplate/mailboxes to skip those resources. This error happened three times.

The second error was after the prune loop deleted a mailbox and then tried to build a summary for writing into the usage DB. This encountered a mailbox for which there were no mailbox_sides entries, causing the started = times[0] in _summarize_mailbox to throw an IndexError. I think this killed the periodic task that does pruning, and after that point nameplates will just keep accumulating and never get removed.

Fortunately the error didn't take down the service: it just resulted in longer wormhole codes than usual for about 12 hours (two digits instead of just one). But the failed pruning loop meant the "last updated" timestamp in the DB wasn't being updated, so the munin plugins assumed the DB was useless and didn't report any data. In this case the relay.sqlite was actually still being updated (and it would have been nice to see the number of nameplates growing).

I need to study the IntegrityError and see if it was just random disk corruption (eek!) or caused by some normal operation (also eek). I'm guessing that the close() managed to delete some parts of the mailbox record (like the mailbox_sides entries) but not the rest, and the resulting incoherent state is what killed the prune loop.

The fixes I need to make are:

catch errors during the pruning loop, log them, and then pass, so we don't let those errors kill the loop
use a proper transaction around all operations, or at least the ones that run at close() and prune(), so we don't get inconsistent states like mailbox_sides being removed without also removing the mailbox entry
consider updating the "last updated" timestamp on every DB modification, not just inside the pruning loop
consider having the munin plugins continue to report results even if the "last updated" timestamp is old, perhaps only having one graph stop graphing when that happens