performance tuning with partitionby?

Question

performance tuning with partitionby?

tooptoop4 opened this issue 5 years ago · 0 comments

I have an input.yaml that defines 5 different data frames (all from various parquet paths on s3), the total size of the data across all 5 combined is 35GB. Then in the metric yaml I have single step that is a SQL joining together the 5 data frames and then finally outputting to parquet on s3 with a partitionBy defined on 2 columns (leading to a total of 80 partitions). Right now it takes 5 hours to run this (using executor memory of 90GB, 16 cores per executor and total-executor-cores of 80). Can you recommend any options to make it run faster? I was thinking maybe there is some shuffle/cache/repartition type options I should use?