AzureCosmosDB/data-migration-desktop-tool

Throttling issue on Cosmos DB sink

codingbandit opened this issue · 2 comments

Ingesting JSON data quickly (Azure VM to Cosmos DB) is causing throughput issues even with the retry policy. There are silent failures as not all data is being ingested. Unable to reproduce locally using the debugger due to the added latency.

Bumping to 2000 RU solved the issue (Default 1000) of the missing data.

JSON Source: https://cosmosdbcosmicworks.blob.core.windows.net/cosmic-works-small/customer.json

Configuration:

{
  "Source": "JSON",
  "Sink": "Cosmos-nosql",
  "Operations": [
    {
      "SourceSettings": {
        "FilePath": "https://cosmosdbcosmicworks.blob.core.windows.net/cosmic-works-small/product.json"
      },
      "SinkSettings": {
        "ConnectionString": "{{cosmosConnectionString}}",
        "Database":"database",
        "Container":"product",
        "PartitionKeyPath":"/categoryId",
        "RecreateContainer": false,
        "BatchSize": 100,
        "ConnectionMode": "Gateway",
        "MaxRetryCount": 5,
        "InitialRetryDurationMs": 200,
        "CreatedContainerMaxThroughput": 1000,
        "UseAutoscaleForCreatedContainer": true,
        "WriteMode": "InsertStream",
        "IsServerlessAccount": false
        }
    },
    {
      "SourceSettings": {
        "FilePath": "https://cosmosdbcosmicworks.blob.core.windows.net/cosmic-works-small/customer.json"
      },
      "SinkSettings": {
        "ConnectionString": "{{cosmosConnectionString}}",
        "Database":"database",
        "Container":"customer",
        "PartitionKeyPath":"/customerId",
        "RecreateContainer": false,
        "BatchSize": 100,
        "ConnectionMode": "Gateway",
        "MaxRetryCount": 5,
        "InitialRetryDurationMs": 200,
        "CreatedContainerMaxThroughput": 1000,
        "UseAutoscaleForCreatedContainer": true,
        "WriteMode": "InsertStream",
        "IsServerlessAccount": false
        }
    }
  ]
}

Test to see if all 'customer' type records are ingested. Expecting 100 with this query:
select count(1) FROM c where c.type='customer'

Ran into this exact issue this week. Missed a lot of records when being throttled :(

Luckily re-running the sync (with Insert/InsertStream) and I increased the retry count + wait time we eventually got all records copied over.

@bowencode your PR would be very handy to see how much was missed, but is there any other guidance that could help reduce this problem?

In general there are a lot of variables so it's hard to specify numbers that will always work. If you assume that the input data can be read infinitely fast then no matter how high the throughput is on the output side there will be failed requests at some point. This is mitigated by the fact that records are written serially and retried before following records are written.

This means that even with very low RU settings, increasing the retry count and duration will eventually allow all records to transfer. The tradeoff is the decrease in performance that may be unacceptable for other scenarios so it's a balance for what defaults values should be. Accurately reporting failures should at least allow jobs without sufficient bandwidth to provide clear notification of the problem.