audienceproject/spark-dynamodb

WCUs constantly dropping after a while

Opened this issue · 5 comments

I am writing some data from an s3 file to dynamoDB and I believe the write limit should not change after the spark job makes the connection and starts writing to DynamoDB. But in my case the write limit starts dropping after a certain time. I have tried running the job multiple times and notice the same behavior.

Here is the cloudwatch metric for the WCU in the table:
image

I have tried setting a constant throughput using the "throughput" parameter and also tried reducing the number of worker nodes on my spark cluster to just two. I still see the same behavior in both the cases. Does the write throughput dynamically change during the write?

Hi @muneerulhudha,

I suspect this is caused by a combination of two issues. The first is that the throughput calculations are made very early, before the actual work of writing to DynamoDB begins. Second is that it seems you have unbalanced partitions. If you have 2 partitions and one of them is significantly larger than the other, the throughput which was 50% for each partition will result in being used to 100% until the smaller partition gets written and 50% until the second partition gets written.

----- 50%
--------------------- 50%

I hope this makes sense. If it does, please confirm, otherwise I will reiterate.

So if I repartition the dataframe before writing to dynamoDB, will that help?

Yes, if you repartition in such a way that you end up with similarly sized partitions, I believe you won’t have a problem anymore.

Hi @muneerulhudha, have you had success working around the issue?

I did a repartition but that did not help as well. I haven't had time to investigate further. I am writing slightly more than 3 million rows of data and all rows are approximately the same size and have a pretty decent partition key. I don't think partition size is the problem here. When I have more time next week, I will further investigate. It's probably something to do with the way my spark cluster handles it and not an issue with the library. I will try to keep this issue updated with my findings.