YotpoLtd/metorikku

cache input?

Closed this issue · 5 comments

I read a 1.3GB input csv.gz (25gb uncompressed, 86mn rows) and then run 4 metric SQLs on it. It takes 16hrs

Would an option to add df.cache at line 30 below work?

https://github.com/YotpoLtd/metorikku/blob/master/src/main/scala/com/yotpo/metorikku/input/readers/file/FilesInput.scala#L30

I see this line 18 times in the spark application log:
INFO FileScanRDD: Reading File path: s3a://redact/.csv.gz, range: 0-1345455747, partition values: [empty row]

You can do inside the metric, just add a step with lazy cache:

  • dataFrameName: df_cached
    sql:
    CACHE LAZY TABLE table_name

you can read more about it here:
https://docs.databricks.com/spark/latest/spark-sql/language-manual/cache-table.html

@giladw after using your suggestion i see only single read from s3 which is good but overall time is still the same, i see a lot of time is spent in https://github.com/YotpoLtd/metorikku/blob/v0.0.47/src/main/scala/com/yotpo/metorikku/metric/Metric.scala#L86 - how to turn off instrumentation? i asked on #76

To turn off counting and instrumentation add the following to your job file:

cacheCountOnOutput: false

Please note that this was added in version 0.0.50

but instrumentation will still be in places like

job.instrumentationClient.gauge(name="timer", value=elapsedTimeInNS, tags=Map("metric" -> metric.metricName))
. ? @lyogev

It's not a problem, it's not doing anything special, just notifying that the metric started.