cache vs not cache when union tables

Question

cache vs not cache when union tables

Closed this issue 6 years ago · 9 comments

Context: union many tables for multiple time frame.
=> Should I cache the union Tables in every steps or let it do the job and cache at the end (after breaking out the loop).

Answer 1 · 2017-12-27T04:33:44.000Z

Count job runs for an hour and can't finish.
The work load is absurd.

pyspark-shell - Details for Job 10755.pdf

Answer 2 · 2017-12-27T06:47:50.000Z

Process failed due to excessive long run time!!

Answer 3 · 2018-01-30T16:29:19.000Z

Continue on this issue, sampling HCM data from 20171105 to 20171118:
Using pyspark small, no taks failed. Driver still alive but GC Overhead occurred when viewing job details. No Executor abnormals detected:
(Running union parquet right after continuous_sampling

Answer 4 · 2018-01-31T03:33:36.000Z

I don't cache at every step of union and the step and complexity just keep increasing.
Remember this is problem of count() command. Might need to eliminate this command.

Answer 5 · 2018-01-31T03:38:16.000Z

Trying read parquet and write directly

Reading files takes no time at all.

But writing takes a little delay to initialize the task. Also, the number of steps is relatively small compare to the sum of each file reading.
Write task: 46237
While for each above counting: ~20000 steps average.

DAG for union is typical.
But 15 files like this might be the limit for small kernel.

Answer 6 · 2018-01-31T03:45:09.000Z

Job starts really slow, 6.1 mins and no finish step recorded

This delay time is to parsing the Logical Plan:

But once it starts the jobs, it goes really fast.
Notice the task time in the stats, 50% is 32s, 75% is 48s and max is 2s. This is pretty OK result, just a bit pity that I can't really see what tasks take 2s though.

Answer 7 · 2018-01-31T04:08:48.000Z

But at the end the above write get GC Overhead Limit exceeded.

Is this a problem of the cluster itself??
To be fair, the file size of this sample is twice the size of 4WK file and 6 times of the 2WK in Oct:

216.5 M  649.6 M  /user/hdfs/ds_school/batch02/thangnt/tc_histories_hcm_20171105_18
66.0 M   197.9 M  /user/hdfs/ds_school/batch02/thangnt/tc_histories_hcm_rndsubs_18k
129.6 M  388.7 M  /user/hdfs/ds_school/batch02/thangnt/tc_histories_hcm_rndsubs_18k_4wk

Rerun writing command and task success, file size up to 790M, this result is little suspicious given that tc_call_histories in Nov have no significant change compare to Oct.

Answer 8 · 2018-01-31T05:43:57.000Z

Repartition at each stage causes a lot of jobs to fail. So only repartition at the end.

Answer 9 · 2018-07-09T06:23:57.000Z

~~Count() only doesn't trigger Spark to collect the data to driver when cache(). Use show() as well~~

See how the DF is cached on spark UI