amplab/succinct

OOM at spark shell in local mode

bzz opened this issue · 9 comments

bzz commented

Hi, thank you for open sourcing this project.

I tried to run it on my spark 1.5.2 in local mode from the spark-shell on 2 datasets

  1. 300mb .gz (2.1 Gb) uncompressed text file.

    I consistently got OOM Java heap space, does not matter if the input is a single non-splittable .gz or an uncompressed text file

    15/12/15 12:27:14 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-0,5,main]
    java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:2271)
    at java.lang.StringCoding.safeTrim(StringCoding.java:79)
    at java.lang.StringCoding.access$300(StringCoding.java:50)
    at java.lang.StringCoding$StringEncoder.encode(StringCoding.java:305)
    at java.lang.StringCoding.encode(StringCoding.java:344)
    at java.lang.StringCoding.encode(StringCoding.java:387)
    at java.lang.String.getBytes(String.java:956)
    at $line20.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:21)
    at $line20.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:21)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    
  2. 8mb text same from above, but first 200,000 lines

Every time on 8mb sample OOM again, GC overhead limit this time

15/12/15 12:36:55 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 4, localhost): java.lang.OutOfMemoryError: GC overhead limit exceeded
at edu.berkeley.cs.succinct.buffers.SuccinctBuffer.constructNPA(SuccinctBuffer.java:445)
at edu.berkeley.cs.succinct.buffers.SuccinctBuffer.construct(SuccinctBuffer.java:307)
at edu.berkeley.cs.succinct.buffers.SuccinctBuffer.<init>(SuccinctBuffer.java:81)
at edu.berkeley.cs.succinct.buffers.SuccinctFileBuffer.<init>(SuccinctFileBuffer.java:30)
at edu.berkeley.cs.succinct.buffers.SuccinctIndexedFileBuffer.<init>(SuccinctIndexedFileBuffer.java:31)
at edu.berkeley.cs.succinct.buffers.SuccinctIndexedFileBuffer.<init>(SuccinctIndexedFileBuffer.java:42)
at edu.berkeley.cs.succinct.SuccinctRDD$.createSuccinctPartition(SuccinctRDD.scala:288)

The code I tried was

import edu.berkeley.cs.succinct._
val wikiData = sc.textFile("....").map(_.getBytes)
val wikiSuccinctData = wikiData.succinct
//or wikiData.saveAsSuccinctFile(...)

Also tried adding same parameters as you have for spark-submit, like

./bin/spark-shell bin/spark-shell --executor-memory 1G  --driver-memory 1G --packages amplab:succinct:0.1.6

with no success yet.

Please advise!

Hi,

Thanks for trying out Succinct Spark! I tried running this on my own 8MB dataset, and it works fine for me. I suspect running it for a 2.1GB dataset will require a larger heap size for the pre-processing step, so a 1G heap size might not suffice.

Can you try increasing the heap size for the executor to 2G and see if it works? We have a more optimized pre-processing algorithm in the 0.1.7 branch, and we plan to push it out to spark packages soon.

bzz commented

Thanks for the advice!

I was able to run 0.1.6 on 8Mb dataset with Spark 1.5.x and configs like

./bin/spark-shell bin/spark-shell --conf "spark.storage.memoryFraction=0.2" --executor-memory 2G  --driver-memory 2G --packages amplab:succinct:0.1.6

Resulting in-memory size for my case is 71Mb

MapPartitionsRDD    Memory Deserialized 1x Replicated   2   100%    71.5 MB

Is that an expected result?
Glanced though but could not find relevant section of the paper about the datastructure size, would appreciate if you could point it.

Thank again for the great project, looking forward 0.1.7

The memory footprint reported by Spark isn't very accurate due to the way the we store Succinct data structures and how Spark measures the memory footprint; there is going to be a better support for computing the memory footprint for complex data structures in Spark 1.6.0, which we will leverage in the next release. For now, the best way to get the size of the data-structures is to save the RDD to disk:

succinctRDD.save("...")

and look at the size of the serialized data structures.

The size of the data-structures generally varies with the dataset (see evaluation section in the paper), but in general, we see 2-4x for common datasets.

Thanks again for the feedback!

bzz commented

Great, thanks for explanation, definitely looking forward next release :)

The size of 8mb dataset in filesystem is 53Mb for me.

If you want, I can close this now or can report the result of experiment with 2Gb dataset here a bit later.

That's very interesting.. I'll be curious to know the results for the 2GB dataset.

Is there any way I can take a look at the dataset?

bzz commented

Sorry for delay. Here is the dataset I was using:

s3cmd get s3://aws-publicdatasets/common-crawl/cc-index/collections/CC-MAIN-2015-40/indexes/cdx-00000.gz
gzcat cdx-00000.gz | head -20000 > 8mb.txt
unzip cdx-00000.gz -ck > 2gb.txt

I ran into the same problem. The dataset I am using has a size of 2.3GB. I am running into OOM even with 24GB memory.

 bin/spark-shell --jars succinct-spark-0.1.7-SNAPSHOT-jar-with-dependencies.jar --driver-memory 24g --executor-memory 24g

I am writing the dataset as a succinct file.

import edu.berkeley.cs.succinct._
val data = sc.textFile("/path/to/input").map(_.getBytes)
data.saveAsSuccinctFile("/path/to/output")

Closing this issue should since its resolved with the latest version of Succinct. Feel free to re-open if the problem re-surfaces.