Very slow writing performance
SandyChapman opened this issue · 6 comments
would it be possible to share a sample dataset, cosmos db config and the code snippet that is being used so as to look into this?
Any updates here?
This should be an importante Issue since we can't use this library on production with this very poor performance.
I can't share my dataset, but it has less then 200 items. And it is taking minutes to upsert about 200 items.
Here is the write configuration:
writeConfig = {
"Endpoint": "",
"Masterkey": "",
"Database": "",
"Collection": ,
"Upsert": "true"
}
df.write.format("com.microsoft.azure.cosmosdb.spark").mode("append").options(**writeConfig).save()
CosmosDB config:
- Throughput : 1000
- Partition key: /date (yyyyMMdd)
- Consistency level: Session
Something that helped me a bit was to .repartition()
the data manually to the number of workers. I'm writing around 1.5M data points with WritingBatchSize = 1000
and connectionmaxpoolsize = 100
. It reduced the writing time from 13 minutes to 9 minutes. And also the number of RUs consumed is much more constant. Without the .repartition()
I'm seeing spikes from time to time.
I ended up just using the Cosmos Python library and did a foreachPartition to leverage parallel execution on the cluster.
- Spark is not a good tool to handle small data. Few minutes for a request independent of the data size is something to expect. It should scale better if you have larger data but few minutes for writing is still something you should expect
- For others I recommend to look at the MAX ru's setting as this by default is 4000 (i think) at that could become the bottleneck. To understand if this is the bottleneck i suggest you to look at Throttled Requests (429s) and Normalized RU Consumption (max) metrics in your Azure
Has anyone found a solution using spark?