BrighterCommand/Brighter

[Bug] Outbox sweeper is prohibitively expensive to use with a Dynamo DB outbox

dhickie opened this issue · 1 comments

Describe the bug

In both v9 and v10, when the outbox sweeper runs it checks the outbox for any outstanding messages past a certain age. For the Dynamo DB implementation, it does this by performing a query operation on the Outstanding index, with a key expression looking at a particular shard for a given topic and the created time for the message (to only retrieve messages past a certain age). It then also applies a filter expression in order to filter out the messages which have a dispatch time (and have therefore already been dispatched).

The issue here is that filter expressions are applied after reading from the table, and then applied server side before sending the filtered data over the wire to the client. Even though the data is filtered server side, the user is still charged for the reads. In this instance, this means that every time the sweeper runs, it's reading the vast majority of the messages from the table.

By way of example, consider an outbox where:

  • 2000 messages are published per minute
  • The sweeper runs every 5 seconds
  • The archiver archives dispatched messages older than 1 hour

In this example, every time the sweeper runs it would be reading 120k messages from the table every 5 seconds, even if none of those messages were actually outstanding. This results in a level of cost that makes it impractical to use a sweeper with a Dynamo DB outbox.

Possible fix

What we really want for the Outstanding index is a sparse index - one in which every message found in the index is one that is yet to be dispatched, removing the need for a filter expression at all. One way to do this would be to make the sort key on the index a simple boolean indicating that the message is yet to be dispatched, however this would remove the ability to only query for messages past a certain age. Instead, we should populate a new numerical attribute called OutstandingCreatedTime on each message which is only populated if the message is yet to be dispatched. This attribute will be the sort key on the Outstanding index, meaning messages that don't have an OutstandingCreatedTime attribute will not be part of the Outstanding index.

The overall flow for a message would therefore be:

  1. The message is added to the outbox, with both the CreatedTime and OutstandingCreatedTime attributes populated
  2. The sweeper runs and queries the Outstanding index, only retrieving messages for which OutstandingCreatedTime is populated
  3. The sweeper publishes the outstanding message, and then populates the DeliveryTime attribute and removes the OutstandingCreatedTime attribute from the record, removing it from the index

This is a breaking change to the outbox table structure, which we could make wholesale to v10. Users upgrading would therefore need to create a new table, as GSIs cannot be edited after creation.

Given that the current implementation makes a dynamo outbox effectively unusable, the change also needs to be made to v9. My suggestion here would be to add a boolean property to DynamoDbConfiguration called SparseOutstandingIndex, which defaults to false. If the flag is false, then continue with the current implementation. If the value is true, then use the approach above. This would essentially allow users of v9 to "opt in" to using the cheaper, more performant table structure if they're doing a new implementation or are willing to perform a migration.

@dhickie This went into V10 as well, yes? Just asking so that I can close.