cache input?
Closed this issue · 5 comments
I read a 1.3GB input csv.gz (25gb uncompressed, 86mn rows) and then run 4 metric SQLs on it. It takes 16hrs
Would an option to add df.cache at line 30 below work?
I see this line 18 times in the spark application log:
INFO FileScanRDD: Reading File path: s3a://redact/.csv.gz, range: 0-1345455747, partition values: [empty row]
You can do inside the metric, just add a step with lazy cache:
- dataFrameName: df_cached
sql:
CACHE LAZY TABLE table_name
you can read more about it here:
https://docs.databricks.com/spark/latest/spark-sql/language-manual/cache-table.html
@giladw after using your suggestion i see only single read from s3 which is good but overall time is still the same, i see a lot of time is spent in https://github.com/YotpoLtd/metorikku/blob/v0.0.47/src/main/scala/com/yotpo/metorikku/metric/Metric.scala#L86 - how to turn off instrumentation? i asked on #76
To turn off counting and instrumentation add the following to your job file:
cacheCountOnOutput: false
Please note that this was added in version 0.0.50
It's not a problem, it's not doing anything special, just notifying that the metric started.