Dataframe write for a wide table with millions of records is much slower than expected

Question

Dataframe write for a wide table with millions of records is much slower than expected

jakesmart-drizly opened this issue 4 years ago · 0 comments

The problem: I am attempting to write a 2.5MM row table with 500 columns to Redis on a r5xlarge (~25gb of ram) with a 20 node EMR cluster, with the following command in PySpark:
user_recs.write.format("org.apache.spark.sql.redis").option( "table", "recs:user").option("key.column", "user_id").option("ttl", redisKeyTtlSeconds).mode("append").save()

and it is taking 2.5 hours, which is slower than I would expect given the data volume. Is there something happening under the hood of spark-redis that is slowing this down? I don't think it's a provisioning, memory issue as adding more cores, ram etc hasn't sped the job up.

Job configuration:
This command is running in:

SPARK 2.4.5
com.redislabs:spark-redis:2.4.0

on an EMR cluster:

20 c5.xlarge (4 cores 8gb ram)

with:
--conf spark.executor.instances=39 --conf spark.yarn.executor.memoryOverhead=1024 --conf spark.executor.memory=2g --conf spark.yarn.driver.memoryOverhead=1024 --conf spark.driver.memory=2g --conf spark.executor.cores=1 --conf spark.driver.cores=1 --conf spark.default.parallelism=780

Looking through the spark logs, it looks like the data is well partitioned.
Each task is approximately 10K rows (5MM records due to the high column count)

I have tried different cluster configuration (30gb memory with 17 executors of 6 cores each on 3 nodes with 32 cores and 256gb ram) and that takes roughly the same amount of time.

Any help would be much appreciated.