Azure/azure-cosmosdb-spark

Optimal configurations for bulk import for a batch of 100000 to 1000000 documents

Sandy247 opened this issue · 1 comments

Could anyone help me with the optimal configurations for the connector while inserting documents from dataframes of sizes ranging from 100000 to 1000000. The spark cluster can autoscale to 20 worker nodes and the cosmos DB database can autoscale to about 4000 RUs(shared throughput between containers although the container that this data is going to is the hottest). Also, what settings for throughput for cosmos and node count for spark would you recommend if I would want to insert 1000000 records in around 5-10 min. Appreciate any help on this. Thanks.

What is the size of each document? Is this a one time process or daily process?
Before you start you should probably read the article if you haven't already https://docs.microsoft.com/en-us/azure/cosmos-db/optimize-cost-reads-writes#optimizing-writes
A few things which are important :

  1. Size of each document
  2. The choice of partition key
  3. Indexing
    1000000 is 10 mins roughly comes to 1667 docs per second. I will assume the size of each doc is <1kb. The Microsoft docs tell us that it takes around 5.5 RU/s to write a 1kb doc with no indexing. If you need to fetch the data later you would probably want indexing on but set this only for specific columns. By default Cosmos indexes all columns. But even if you take 5.5 RU/s you will need 1667 X 5.5 RU/s which is around 9K RU/s. If your docs are more than 1kb in size and/or if you are using indexing you will need more than this.
    When adjusting the RU/s I would suggest you keep checking the Metrics to see if you are hitting the peaks and adjust accordingly.
    Assuming this is a one time or batch process, bring the RU/s back down to avoid extra costs.
    The speed of ingestion also depends how many physical partitions you have and if you set the RU/s to more than 10K, cosmos will automatically start to split the partitions.