mhthies/QAQAbot

Synchronous game gets stalled under certain conditions

mhthies opened this issue · 3 comments

Report:

Beim ersten Mal machte der Bot nichts nachdem alle die erste Antwort geschrieben hatten und beim zweiten Mal nach der zweiten Antwort. Wir konnten also nicht weiterspielen obwohl alle Antworten gegeben waren.

Den status haben wir gefragt, er hat auf niemanden gewartet. Wir probieren gerade den Asynchronmodus, das scheint bisher besser zu funktionieren.

Und das passiert nur im synchronmodus

Here's the relevant extract from the log:

2020-05-10 23:36:45,189 - qaqa_bot.bot - INFO - Got message from <snip>: <snip>
2020-05-10 23:36:45,195 - qaqa_bot.bot - INFO - Got message from <snip>: <snip>
2020-05-10 23:36:45,198 - qaqa_bot.bot - INFO - Got message from <snip>: <snip>
2020-05-10 23:36:45,209 - qaqa_bot.game - INFO - Adding entry by user 3 to sheet 19 in game 6.
2020-05-10 23:36:45,209 - qaqa_bot.game - INFO - Adding entry by user 5 to sheet 18 in game 6.
2020-05-10 23:36:45,209 - qaqa_bot.game - INFO - Adding entry by user 6 to sheet 17 in game 6.
2020-05-10 23:36:45,249 - qaqa_bot.game - INFO - Checking if new round in synchronous game 6 should be triggered.
2020-05-10 23:36:45,262 - qaqa_bot.game - INFO - Checking if new round in synchronous game 6 should be triggered.
2020-05-10 23:36:45,273 - qaqa_bot.game - INFO - Checking if new round in synchronous game 6 should be triggered.

It seems, we have issues with concurrency here (Yeah …!): Probably, the SQLAlchemy session is not synced with the database after adding the entry and before checking if the game is finished. Thus, in all three threads, the next round is not triggered.

We should also check the MySQL ISOLATIONLEVEL setting. Probably, the transactions aren't even serialized by the database.

Okay, the problem is actually the SQL transaction isolation level: By default, MySQL uses REPEATABLE READS, which results in a write-after-read race condition, since each concurrent transaction's SELECT queries only see the state of the database from the beginning of the transaction. This means for concurrent execution of submit_text(), that none of the threads starts the new round as visible in the log above.

There are two ways to solve the problem:

a) using SERIALIZABLE transaction isolation

This means, SELECT queries lock the relevant tables for UPDATEs and INSERTS until the end of the transaction, so there is no possibility of a write-after-read race condition. Instead, concurrent actions will result in Deadlock/Rollback Exceptions. Thus, we need to deal with in this case and retry the actions which failed – which is a bit hacky to be database-agnostic and make sure, no Telegram Messages are sent twice. Additionally, we lose some concurrency.

b) Split transactions and do explicit database update when triggering new round

First, we need to deal with the write-after-read race condition: We can avoid it by committing the transaction after adding the new entry and start a new transaction to check and update the resulting game state. The second is to ensure, that a new round is not triggered twice, i.e. the entry request Telegram messages are not sent twice. This can be achieved by doing an explicit UPDATE to the database, which can only happen one time and failes when executing a second time concurrently. We could store the current round in the games table and update it, when the Telegram messages for the next round are sent.

However, we still have to deal with Deadlock/Rollback Exceptions due to concurrent UPDATEs. In most cases, they can be silently ignored, since the concurrent try to do the same change. However, in rare cases (such as configuring the game), we will actually want to serialize them by retrying the action method.

Conclusion

Currently, I'm in favor for a), since it will resolve concurrency issues more consistently. However it requires to double-check the action function's command order to avoid doubled Telegram Messages.

To make sure, no Telegram messages are sent repeatedly, we should probably switch back to returning the messages instead of sending them in the GamerServer internally. We currently don't receive the Exceptions which might happen while sending anyway, due to the queued sending. However, if we go for a), we would need to split the GameServer action functions which use a split transaction into two functions to allow retrying of both transactions individually.

Another possibility solution would be flushing() the session before sending messages to raise any Deadlock/Rockback Exceptions, that might have occured, every time before sending of messages.