JohnSnowLabs/spark-nlp-workshop

Fix: com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel. : java.lang.OutOfMemoryError: Java heap space while running BERT Embedding

veilupt opened this issue · 1 comments

Fix OOM on Java Heap space while running BERT Embedding on pyspark

Question:

  1. What is machine configuration for running BERT Embedding?
  2. How to setup Java Heap space?

Steps to Reproduce

Code:

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.embeddings import *
data = [
  ("New York is the greatest city in the world", 0),
  ("The beauty of Paris is vast", 1),
  ("The Centre Pompidou is in Paris", 1)
]
df = spark.createDataFrame(data, ["text","label"])
document_assembler = DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")
tokenizer = Tokenizer().setInputCols(["document"])\
  .setOutputCol("token")
word_embeddings = BertEmbeddings.pretrained('bert_base_cased', 'en')\
  .setInputCols(["document", "token"])\
  .setOutputCol("embeddings")
bert_pipeline = Pipeline().setStages(
  [
    document_assembler,
    tokenizer,
    word_embeddings
  ]
)
df_bert = bert_pipeline.fit(df).transform(df)
display(df_bert)

Error Log

Approximate size to download 389.2 MB
Download done! Loading the resource.
[ — ]2020-08-11 03:43:45.487324: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-08-11 03:43:45.493870: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300000000 Hz
2020-08-11 03:43:45.494190: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f3b55fa2960 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-08-11 03:43:45.494235: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
[ \ ]20/08/11 03:43:47 WARN MemoryStore: Not enough space to cache broadcast_5 in memory! (computed 417.4 MB so far)
20/08/11 03:43:47 WARN BlockManager: Persisting block broadcast_5 to disk instead.
[ / ]20/08/11 03:47:03 WARN BlockManager: Block broadcast_5 could not be removed as it was not found on disk or in memory
[OK!]
Traceback (most recent call last):
File "", line 2, in
File "/home/pt4_gcp/spark-nlp/anaconda3/envs/sparknlp/lib/python3.6/site-packages/sparknlp/annotator.py", line 1846, in pretrained
return ResourceDownloader.downloadModel(BertEmbeddings, name, lang, remote_loc)
File "/home/pt4_gcp/spark-nlp/anaconda3/envs/sparknlp/lib/python3.6/site-packages/sparknlp/pretrained.py", line 41, in downloadModel
j_obj = _internal._DownloadModel(reader.name, name, language, remote_loc, j_dwn).apply()
File "/home/pt4_gcp/spark-nlp/anaconda3/envs/sparknlp/lib/python3.6/site-packages/sparknlp/internal.py", line 176, in init
super(_DownloadModel, self).init("com.johnsnowlabs.nlp.pretrained."+validator+".downloadModel", reader, name, language, remote_loc)
File "/home/pt4_gcp/spark-nlp/anaconda3/envs/sparknlp/lib/python3.6/site-packages/sparknlp/internal.py", line 129, in init
self._java_obj = self.new_java_obj(java_obj, *args)
File "/home/pt4_gcp/spark-nlp/anaconda3/envs/sparknlp/lib/python3.6/site-packages/sparknlp/internal.py", line 139, in new_java_obj
return self._new_java_obj(java_class, *args)
File "/home/pt4_gcp/spark-nlp/anaconda3/envs/sparknlp/lib/python3.6/site-packages/pyspark/ml/wrapper.py", line 67, in _new_java_obj
return java_obj(*java_args)
File "/home/pt4_gcp/spark-nlp/anaconda3/envs/sparknlp/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in call
File "/home/pt4_gcp/spark-nlp/anaconda3/envs/sparknlp/lib/python3.6/site-packages/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/home/pt4_gcp/spark-nlp/anaconda3/envs/sparknlp/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel.
: java.lang.OutOfMemoryError: Java heap space
at java.nio.file.Files.read(Files.java:3099)
at java.nio.file.Files.readAllBytes(Files.java:3158)
at com.johnsnowlabs.ml.tensorflow.TensorflowWrapper.writeObject(TensorflowWrapper.scala:173)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1154)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
at org.apache.spark.serializer.SerializationStream.writeAll(Serializer.scala:140)
at org.apache.spark.serializer.SerializerManager.dataSerializeStream(SerializerManager.scala:174)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1$$anonfun$apply$7.apply(BlockManager.scala:1174)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1$$anonfun$apply$7.apply(BlockManager.scala:1172)
at org.apache.spark.storage.DiskStore.put(DiskStore.scala:69)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1172)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
at org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:914)
at org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:1481)
at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:123)
at org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:88)
at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)

Environment

  • Spark-NLP version: 2.5.4
  • Apache Spark version: 2.4.4
  • Java version : openjdk version "1.8.0_265"
  • Operating System and version: Ubuntu 18.04 (Google VM)
  • VM Machine: 4 CPU, 15 GB RAM, 30 GB SSD
  • Deployment (Docker, Jupyter, Scala, pip, conda, etc.): Jupyter

What is machine configuration for running BERT Embedding?

It also depends on the size of the dataset, but contextualize embeddings such as BERT , ALBERT, XLNet, and USE are best perform on GPU and at least 16G memory. (this will increase depending on the size of the dataset)

How to setup Java Heap space?

It depends on how you start the SparkSession. In pyspark you need spark.driver.memory, in SparkSession you need the same but inside .conf(), or you can use our sparknlp.start(). You didn't show that part of the code as how you start SparkSession.