Very slow writing performance

Question

Very slow writing performance

SandyChapman opened this issue 5 years ago · 6 comments

Issue #318 was closed prematurely. This ticket is to have slow dataframe writes addressed. See #318 for more details.

Answer 1 · 2020-06-03T12:58:07.000Z

would it be possible to share a sample dataset, cosmos db config and the code snippet that is being used so as to look into this?

Answer 2 · 2020-06-23T14:23:14.000Z

Any updates here?

This should be an importante Issue since we can't use this library on production with this very poor performance.

I can't share my dataset, but it has less then 200 items. And it is taking minutes to upsert about 200 items.

Here is the write configuration:

writeConfig = {
        "Endpoint": "",
        "Masterkey": "",
        "Database": "",
        "Collection": ,
        "Upsert": "true"
}    

df.write.format("com.microsoft.azure.cosmosdb.spark").mode("append").options(**writeConfig).save()

CosmosDB config:

Throughput : 1000
Partition key: /date (yyyyMMdd)
Consistency level: Session

Answer 3 · 2020-07-28T14:55:51.000Z

Something that helped me a bit was to .repartition() the data manually to the number of workers. I'm writing around 1.5M data points with WritingBatchSize = 1000 and connectionmaxpoolsize = 100. It reduced the writing time from 13 minutes to 9 minutes. And also the number of RUs consumed is much more constant. Without the .repartition() I'm seeing spikes from time to time.

Answer 4 · 2020-07-28T15:07:28.000Z

I ended up just using the Cosmos Python library and did a foreachPartition to leverage parallel execution on the cluster.

Answer 5 · 2022-04-25T13:03:10.000Z

Spark is not a good tool to handle small data. Few minutes for a request independent of the data size is something to expect. It should scale better if you have larger data but few minutes for writing is still something you should expect
For others I recommend to look at the MAX ru's setting as this by default is 4000 (i think) at that could become the bottleneck. To understand if this is the bottleneck i suggest you to look at Throttled Requests (429s) and Normalized RU Consumption (max) metrics in your Azure

Answer 6 · 2024-04-27T01:48:28.000Z

Has anyone found a solution using spark?