[Bug] Outbox sweeper is prohibitively expensive to use with a Dynamo DB outbox
dhickie opened this issue · 1 comments
Describe the bug
In both v9 and v10, when the outbox sweeper runs it checks the outbox for any outstanding messages past a certain age. For the Dynamo DB implementation, it does this by performing a query operation on the Outstanding
index, with a key expression looking at a particular shard for a given topic and the created time for the message (to only retrieve messages past a certain age). It then also applies a filter expression in order to filter out the messages which have a dispatch time (and have therefore already been dispatched).
The issue here is that filter expressions are applied after reading from the table, and then applied server side before sending the filtered data over the wire to the client. Even though the data is filtered server side, the user is still charged for the reads. In this instance, this means that every time the sweeper runs, it's reading the vast majority of the messages from the table.
By way of example, consider an outbox where:
- 2000 messages are published per minute
- The sweeper runs every 5 seconds
- The archiver archives dispatched messages older than 1 hour
In this example, every time the sweeper runs it would be reading 120k messages from the table every 5 seconds, even if none of those messages were actually outstanding. This results in a level of cost that makes it impractical to use a sweeper with a Dynamo DB outbox.
Possible fix
What we really want for the Outstanding
index is a sparse index - one in which every message found in the index is one that is yet to be dispatched, removing the need for a filter expression at all. One way to do this would be to make the sort key on the index a simple boolean indicating that the message is yet to be dispatched, however this would remove the ability to only query for messages past a certain age. Instead, we should populate a new numerical attribute called OutstandingCreatedTime
on each message which is only populated if the message is yet to be dispatched. This attribute will be the sort key on the Outstanding
index, meaning messages that don't have an OutstandingCreatedTime
attribute will not be part of the Outstanding
index.
The overall flow for a message would therefore be:
- The message is added to the outbox, with both the
CreatedTime
andOutstandingCreatedTime
attributes populated - The sweeper runs and queries the
Outstanding
index, only retrieving messages for whichOutstandingCreatedTime
is populated - The sweeper publishes the outstanding message, and then populates the
DeliveryTime
attribute and removes theOutstandingCreatedTime
attribute from the record, removing it from the index
This is a breaking change to the outbox table structure, which we could make wholesale to v10. Users upgrading would therefore need to create a new table, as GSIs cannot be edited after creation.
Given that the current implementation makes a dynamo outbox effectively unusable, the change also needs to be made to v9. My suggestion here would be to add a boolean property to DynamoDbConfiguration
called SparseOutstandingIndex
, which defaults to false
. If the flag is false
, then continue with the current implementation. If the value is true
, then use the approach above. This would essentially allow users of v9 to "opt in" to using the cheaper, more performant table structure if they're doing a new implementation or are willing to perform a migration.