OOM at spark shell in local mode
bzz opened this issue · 9 comments
Hi, thank you for open sourcing this project.
I tried to run it on my spark 1.5.2 in local mode from the spark-shell on 2 datasets
-
300mb .gz (2.1 Gb) uncompressed text file.
I consistently got OOM Java heap space, does not matter if the input is a single non-splittable .gz or an uncompressed text file
15/12/15 12:27:14 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-0,5,main] java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.lang.StringCoding.safeTrim(StringCoding.java:79) at java.lang.StringCoding.access$300(StringCoding.java:50) at java.lang.StringCoding$StringEncoder.encode(StringCoding.java:305) at java.lang.StringCoding.encode(StringCoding.java:344) at java.lang.StringCoding.encode(StringCoding.java:387) at java.lang.String.getBytes(String.java:956) at $line20.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:21) at $line20.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:21) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
-
8mb text same from above, but first 200,000 lines
Every time on 8mb sample OOM again, GC overhead limit this time
15/12/15 12:36:55 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 4, localhost): java.lang.OutOfMemoryError: GC overhead limit exceeded
at edu.berkeley.cs.succinct.buffers.SuccinctBuffer.constructNPA(SuccinctBuffer.java:445)
at edu.berkeley.cs.succinct.buffers.SuccinctBuffer.construct(SuccinctBuffer.java:307)
at edu.berkeley.cs.succinct.buffers.SuccinctBuffer.<init>(SuccinctBuffer.java:81)
at edu.berkeley.cs.succinct.buffers.SuccinctFileBuffer.<init>(SuccinctFileBuffer.java:30)
at edu.berkeley.cs.succinct.buffers.SuccinctIndexedFileBuffer.<init>(SuccinctIndexedFileBuffer.java:31)
at edu.berkeley.cs.succinct.buffers.SuccinctIndexedFileBuffer.<init>(SuccinctIndexedFileBuffer.java:42)
at edu.berkeley.cs.succinct.SuccinctRDD$.createSuccinctPartition(SuccinctRDD.scala:288)
The code I tried was
import edu.berkeley.cs.succinct._
val wikiData = sc.textFile("....").map(_.getBytes)
val wikiSuccinctData = wikiData.succinct
//or wikiData.saveAsSuccinctFile(...)
Also tried adding same parameters as you have for spark-submit
, like
./bin/spark-shell bin/spark-shell --executor-memory 1G --driver-memory 1G --packages amplab:succinct:0.1.6
with no success yet.
Please advise!
Hi,
Thanks for trying out Succinct Spark! I tried running this on my own 8MB dataset, and it works fine for me. I suspect running it for a 2.1GB dataset will require a larger heap size for the pre-processing step, so a 1G heap size might not suffice.
Can you try increasing the heap size for the executor to 2G and see if it works? We have a more optimized pre-processing algorithm in the 0.1.7 branch, and we plan to push it out to spark packages soon.
Thanks for the advice!
I was able to run 0.1.6 on 8Mb dataset with Spark 1.5.x and configs like
./bin/spark-shell bin/spark-shell --conf "spark.storage.memoryFraction=0.2" --executor-memory 2G --driver-memory 2G --packages amplab:succinct:0.1.6
Resulting in-memory size for my case is 71Mb
MapPartitionsRDD Memory Deserialized 1x Replicated 2 100% 71.5 MB
Is that an expected result?
Glanced though but could not find relevant section of the paper about the datastructure size, would appreciate if you could point it.
Thank again for the great project, looking forward 0.1.7
The memory footprint reported by Spark isn't very accurate due to the way the we store Succinct data structures and how Spark measures the memory footprint; there is going to be a better support for computing the memory footprint for complex data structures in Spark 1.6.0, which we will leverage in the next release. For now, the best way to get the size of the data-structures is to save the RDD to disk:
succinctRDD.save("...")
and look at the size of the serialized data structures.
The size of the data-structures generally varies with the dataset (see evaluation section in the paper), but in general, we see 2-4x for common datasets.
Thanks again for the feedback!
Great, thanks for explanation, definitely looking forward next release :)
The size of 8mb dataset in filesystem is 53Mb for me.
If you want, I can close this now or can report the result of experiment with 2Gb dataset here a bit later.
That's very interesting.. I'll be curious to know the results for the 2GB dataset.
Is there any way I can take a look at the dataset?
Sorry for delay. Here is the dataset I was using:
s3cmd get s3://aws-publicdatasets/common-crawl/cc-index/collections/CC-MAIN-2015-40/indexes/cdx-00000.gz
gzcat cdx-00000.gz | head -20000 > 8mb.txt
unzip cdx-00000.gz -ck > 2gb.txt
I ran into the same problem. The dataset I am using has a size of 2.3GB. I am running into OOM even with 24GB memory.
bin/spark-shell --jars succinct-spark-0.1.7-SNAPSHOT-jar-with-dependencies.jar --driver-memory 24g --executor-memory 24g
I am writing the dataset as a succinct file.
import edu.berkeley.cs.succinct._
val data = sc.textFile("/path/to/input").map(_.getBytes)
data.saveAsSuccinctFile("/path/to/output")
Closing this issue should since its resolved with the latest version of Succinct. Feel free to re-open if the problem re-surfaces.