is there a way to ensure csv in input.yaml is read by multiple executors?

Question

is there a way to ensure csv in input.yaml is read by multiple executors?

tooptoop4 opened this issue 4 years ago · 1 comments

in input.yaml we can define source files ie abc.csv that get loaded into a dataframe. If this is a big file like 300GB will it be read all on a single executor? or is there a way to set shuffle/partitions.etc

Answer 1 · 2020-02-20T13:20:48.000Z

If you have only a single 300GB file that you load in metorikku you can set up first a step in the metric file that uses DISTRIBUTE BY or CLSUTER BY.
Check out the documentation here:
https://docs.databricks.com/spark/latest/spark-sql/language-manual/select.html

But in any case, the first read will be done by a single executor.