[Revise] Data Shuffling

Question

[Revise] Data Shuffling

Opened this issue 10 months ago · 0 comments

Shuffling in distributed compute is almost always the cause for slow queries!

What is shuffle?
Shuffle is when split up data needs to be reorganized based on some sort of key. For example, GROUP BY user_id would cause shuffle to guarantee that all the data for each user_id is on the same machine!

Why is shuffle slow?
The higher the cardinality of your key, the slower the shuffle is going to take. You can also cause skew to show up in your data set. For example, Beyoncé gets a lot more messages on Instagram than you do. The one machine that gets all of her messages is going to struggle.

This struggle is called skew. And if the skew is bad enough, then the machine just gives up and cries. This is called out of memory exception (i.e. OOM). OOM is one of the most painful DE errors to troubleshoot!

How to fix slow shuffle?
There’s a few ways to go to fix this.

remove shuffling
This is possible through techniques like bucket join and broadcast join. Bucket join means that you write all of your data into user_id buckets ahead of time so you don’t have to shuffle again later. Bucket joins are very fast for large data sets.
Broadcast join works only when one side of the data is small. So small you can just send the entire copy to each machine. This only works when the small data set is <1gb.
fix skew
You can fix skew in two ways.

Remove the outlier records and process them separately.
Enable adaptive execution in Spark!

Sources

Zach Wilson

References

https://www.junaideffendi.com/blog/solving-data-skewness-in-spark/