kaloianm/workscripts

Defragmentation script stuck on overlapping range deleter task

Closed this issue · 3 comments

There is nothing that prevent the script to move a range back and forth between two shards. If this happens the second migration will get stuck waiting for the range deleter of the first migration to complete, since the default range deleter task execution delay is 15min the script will get stuck for 15 min.

Imagine the following scenario with old chunk size of 1MB and the new target chunk size of 4MB:

Shard0: [0, 10][30, 40]
Shard1: [10, 20]
Shard2: [20, 30]

Action 1 move [0, 10] Shard0 -> Shard1
Note: for some reason we don't move [30, 40] it could be because at this point Shard2 is overloaded with respect to Shard0

Shard0: [30, 40]
Shard1: [0, 20]
Shard2: [20, 30]

Action 2 move [0, 20] Shard1 -> Shard2

Shard0: [30, 40]
Shard1: []
Shard2: [0, 30]

Action 3 move [0, 30] Shard2 -> Shard0

The problem is that Shard0 still needs to range-delete [0, 10] so the migration of [0, 30] get stuck for 15 min.

As a workaround solution we can use orphanCleanupDelaySecs parameter to reduce the delay to 0