databricks/simr

Data not cache in memory

jerrylam opened this issue · 1 comments

We are planning to make use of spark in replacing some of the computations we have currently implemented in Hadoop MR. We would like to show the difference in performance between Spark and Hadoop MR.

The test case is very simple:

val data = sc.textFile("/tmp/testdata.txt")
data.filter(line => line.contains("SOME_TEXT_STRING")).count
data.persist

I expect that if I do the filtering multiple times, the data will be cached and it should run much faster. Unfortunately, we are not able to cache data in memory as shown from the logs below. Can anyone shed some lights on it?

2014-02-03 14:44:28,997 INFO org.apache.spark.storage.MemoryStore: ensureFreeSpace(179432365) called with curMem=538649985, maxMem=658625003
2014-02-03 14:44:28,997 INFO org.apache.spark.storage.MemoryStore: Will not store rdd_1_35 as it would require dropping another block from the same RDD
2014-02-03 14:44:28,997 INFO org.apache.spark.storage.BlockManager: Dropping block rdd_1_35 from memory
2014-02-03 14:44:28,997 WARN org.apache.spark.storage.BlockManager: Block rdd_1_35 could not be dropped from memory as it does not exist
2014-02-03 14:44:29,000 INFO org.apache.spark.storage.BlockManagerMaster: Updated info of block rdd_1_35
2014-02-03 14:44:29,003 INFO org.apache.spark.storage.BlockManagerMaster: Updated info of block rdd_1_35
2014-02-03 14:44:29,158 INFO org.apache.spark.executor.Executor: Serialized size of result for 607 is 474
2014-02-03 14:44:29,158 INFO org.apache.spark.executor.Executor: Sending result for 607 directly to driver
2014-02-03 14:44:29,159 INFO org.apache.spark.executor.Executor: Finished task ID 607
2014-02-03 14:44:29,163 INFO org.apache.spark.executor.CoarseGrainedExecutorBackend: Got assigned task 614
2014-02-03 14:44:29,163 INFO org.apache.spark.executor.Executor: Running task ID 614
2014-02-03 14:44:29,166 INFO org.apache.spark.storage.BlockManager: Found block broadcast_0 locally
2014-02-03 14:44:29,169 INFO org.apache.spark.CacheManager: Partition rdd_1_42 not found, computing it

Try doing:

val data = sc.textFile("/tmp/testdata.txt").cache()

On Mon, Feb 3, 2014 at 1:28 PM, jerrylam notifications@github.com wrote:

We are planning to make use of spark in replacing some of the computations
we have currently implemented in Hadoop MR. We would like to show the
difference in performance between Spark and Hadoop MR.

The test case is very simple:

val data = sc.textFile("/tmp/testdata.txt")
data.filter(line => line.contains("SOME_TEXT_STRING")).count
data.persist

I expect that if I do the filtering multiple times, the data will be
cached and it should run much faster. Unfortunately, we are not able to
cache data in memory as shown from the logs below. Can anyone shed some
lights on it?

2014-02-03 14:44:28,997 INFO org.apache.spark.storage.MemoryStore:
ensureFreeSpace(179432365) called with curMem=538649985, maxMem=658625003
2014-02-03 14:44:28,997 INFO org.apache.spark.storage.MemoryStore: Will
not store rdd_1_35 as it would require dropping another block from the same
RDD
2014-02-03 14:44:28,997 INFO org.apache.spark.storage.BlockManager:
Dropping block rdd_1_35 from memory
2014-02-03 14:44:28,997 WARN org.apache.spark.storage.BlockManager: Block
rdd_1_35 could not be dropped from memory as it does not exist
2014-02-03 14:44:29,000 INFO org.apache.spark.storage.BlockManagerMaster:
Updated info of block rdd_1_35
2014-02-03 14:44:29,003 INFO org.apache.spark.storage.BlockManagerMaster:
Updated info of block rdd_1_35
2014-02-03 14:44:29,158 INFO org.apache.spark.executor.Executor:
Serialized size of result for 607 is 474
2014-02-03 14:44:29,158 INFO org.apache.spark.executor.Executor: Sending
result for 607 directly to driver
2014-02-03 14:44:29,159 INFO org.apache.spark.executor.Executor: Finished
task ID 607
2014-02-03 14:44:29,163 INFO
org.apache.spark.executor.CoarseGrainedExecutorBackend: Got assigned task
614
2014-02-03 14:44:29,163 INFO org.apache.spark.executor.Executor: Running
task ID 614
2014-02-03 14:44:29,166 INFO org.apache.spark.storage.BlockManager: Found
block broadcast_0 locally
2014-02-03 14:44:29,169 INFO org.apache.spark.CacheManager: Partition
rdd_1_42 not found, computing it

Reply to this email directly or view it on GitHubhttps://github.com//issues/18
.