anhnongdan/Spark1.6_Problems

cache vs not cache when union tables

Closed this issue ยท 9 comments

Context: union many tables for multiple time frame.
=> Should I cache the union Tables in every steps or let it do the job and cache at the end (after breaking out the loop).

Count job runs for an hour and can't finish.
The work load is absurd.

pyspark-shell - Details for Job 10755.pdf

Process failed due to excessive long run time!!

Continue on this issue, sampling HCM data from 20171105 to 20171118:
Using pyspark small, no taks failed. Driver still alive but GC Overhead occurred when viewing job details. No Executor abnormals detected:
(Running union parquet right after continuous_sampling

screen shot 2018-01-30 at 11 28 17 pm

I don't cache at every step of union and the step and complexity just keep increasing.
Remember this is problem of count() command. Might need to eliminate this command.

screen shot 2018-01-31 at 10 31 51 am 1
screen shot 2018-01-31 at 10 32 01 am
screen shot 2018-01-31 at 10 32 10 am

Trying read parquet and write directly

Reading files takes no time at all.
screen shot 2018-01-31 at 10 36 40 am

But writing takes a little delay to initialize the task. Also, the number of steps is relatively small compare to the sum of each file reading.
Write task: 46237
While for each above counting: ~20000 steps average.

DAG for union is typical.
But 15 files like this might be the limit for small kernel.
screen shot 2018-01-31 at 10 40 37 am

Job starts really slow, 6.1 mins and no finish step recorded

screen shot 2018-01-31 at 10 43 07 am

This delay time is to parsing the Logical Plan:
screen shot 2018-01-31 at 11 15 21 am

But once it starts the jobs, it goes really fast.
Notice the task time in the stats, 50% is 32s, 75% is 48s and max is 2s. This is pretty OK result, just a bit pity that I can't really see what tasks take 2s though.

screen shot 2018-01-31 at 10 51 56 am

But at the end the above write get GC Overhead Limit exceeded.

Is this a problem of the cluster itself??
To be fair, the file size of this sample is twice the size of 4WK file and 6 times of the 2WK in Oct:

216.5 M  649.6 M  /user/hdfs/ds_school/batch02/thangnt/tc_histories_hcm_20171105_18
66.0 M   197.9 M  /user/hdfs/ds_school/batch02/thangnt/tc_histories_hcm_rndsubs_18k
129.6 M  388.7 M  /user/hdfs/ds_school/batch02/thangnt/tc_histories_hcm_rndsubs_18k_4wk

Rerun writing command and task success, file size up to 790M, this result is little suspicious given that tc_call_histories in Nov have no significant change compare to Oct.

Repartition at each stage causes a lot of jobs to fail. So only repartition at the end.

Count() only doesn't trigger Spark to collect the data to driver when cache(). Use show() as well

See how the DF is cached on spark UI
screen shot 2018-07-09 at 1 36 35 pm