spark-tensorflow-distributor: RAM overflow when running ResNet152

Question

spark-tensorflow-distributor: RAM overflow when running ResNet152

wobfan opened this issue 3 years ago · 1 comments

Hello!

I am using the spark-tensorflow-distributor package to run TensorFlow jobs on our Spark-on-YARN 3-node-cluster. We are running another cluster, with the exact same specs, but with native TensorFlow distribution on the cluster, not using Spark-on-YARN. Both clusters feature 64 core CPUs, with 188 GB usable RAM, and 12 GPUs with 10 GB RAM each.

Both clusters are running on Python 3.7.3, with tensorflow==2.4.1. The Spark-cluster also has spark-tensorflow-distributor==0.1.0 installed.

To get a little insight in performance differences, we ran the ResNet152 network with the CIFAR-10 dataset on both of them, as both are included out of the box in TF packages. I'll attach the code below.

Although we are using the exact same code on both clusters, with the same dataset and the same network, the one on Spark eats WAY more RAM than the one that's being distributed by TF itself: While Spark initially uses up to 137 GB RAM and stays there most of the time (with peaks of 148 GB RAM usage), the TF-distributed model only uses a maximum of 28 GB RAM at it's peak, slowly starting from 17 GB.

Everything else (we compared GPU memory, usage, CPU usage, network I/O, etc.) seems to be somewhat comparable to each other, but the RAM usage differs extremely. When using a bigger dataset, it does even overflow the RAM at some point in the calculations, causing an EOF Exception at some point - while the natively-distributed one does only use about 50 GBs RAM and smoothly succeeds.

This is the code I am using:

from spark_tensorflow_distributor import *

def train():
  import tensorflow as tf
  import numpy as np

  model = tf.keras.applications.ResNet152(
  include_top=True, weights='imagenet', input_tensor=None,
  input_shape=None, pooling=None, classes=1000)
  
  model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

  (train_images, train_labels), (test_images, test_labels) = tf.keras.datasets.cifar10.load_data()
  train_images = tf.image.resize(train_images, (224,224))

  dataset = tf.data.Dataset.from_tensor_slices((train_images, train_labels))
  dataset = dataset.shuffle(100)
  dataset = dataset.batch(512)

  model.fit(dataset, epochs=3)

MirroredStrategyRunner(num_slots=12, use_gpu=True).run(train)

Any clue on this strange behaviour, or what causes it?

Many thanks in advance! :-)

Answer 1 · 2023-04-29T12:26:26.000Z

any update on this?