cache vs not cache when union tables
Closed this issue ยท 9 comments
Context: union many tables for multiple time frame.
=> Should I cache the union Tables in every steps or let it do the job and cache at the end (after breaking out the loop).
Count job runs for an hour and can't finish.
The work load is absurd.
Process failed due to excessive long run time!!
Trying read parquet and write directly
Reading files takes no time at all.
But writing takes a little delay to initialize the task. Also, the number of steps is relatively small compare to the sum of each file reading.
Write task: 46237
While for each above counting: ~20000 steps average.
DAG for union is typical.
But 15 files like this might be the limit for small kernel.
Job starts really slow, 6.1 mins and no finish step recorded
This delay time is to parsing the Logical Plan:
But once it starts the jobs, it goes really fast.
Notice the task time in the stats, 50% is 32s, 75% is 48s and max is 2s. This is pretty OK result, just a bit pity that I can't really see what tasks take 2s though.
But at the end the above write get GC Overhead Limit exceeded.
Is this a problem of the cluster itself??
To be fair, the file size of this sample is twice the size of 4WK file and 6 times of the 2WK in Oct:
216.5 M 649.6 M /user/hdfs/ds_school/batch02/thangnt/tc_histories_hcm_20171105_18
66.0 M 197.9 M /user/hdfs/ds_school/batch02/thangnt/tc_histories_hcm_rndsubs_18k
129.6 M 388.7 M /user/hdfs/ds_school/batch02/thangnt/tc_histories_hcm_rndsubs_18k_4wk
Rerun writing command and task success, file size up to 790M, this result is little suspicious given that tc_call_histories in Nov have no significant change compare to Oct.
Repartition at each stage causes a lot of jobs to fail. So only repartition at the end.