How to cache scramble tables in Spark?

Question

How to cache scramble tables in Spark?

Opened this issue 5 years ago · 2 comments

Answer 1 · 2019-04-03T14:22:46.000Z

The standard caching statement [1] should work when prefixed with bypass. For example.

verdict.sql('bypass cache table schema.scramble_table')

Disclaimer: We have not tested this yet, so I am not 100% certain.

[1] https://docs.databricks.com/spark/latest/spark-sql/language-manual/cache-table.html

Answer 2 · 2019-04-07T14:44:21.000Z

Sorry, it seems it does not work. I cached the scramble lineitem table as well as the verdictdbmeta table. I can see the tables are cached from the Spark UI, however, the TPC-H Q1 still takes the same amount of time as when the tables are not cached ...

Here's my code:

  verdict.setDefaultSchema(schema) // tpch1g
  verdict.sql("bypass cache table lineitem")
  verdict.sql("bypass cache table orders")
  verdict.sql("bypass cache table verdictdbmeta.verdictdbmeta")
  verdict.sql("bypass cache table lineitem_scramble")
  verdict.sql("bypass cache table orders_scramble")
  val q_verdict = spark.sparkContext.getConf.get("spark.verdictdb.query") // Q1, Q6, or Q14
  val rs_verdict = verdict.sql(q_verdict)
  rs_verdict.print()