NVIDIA/tensorflow

the same code and configuration, nvidia-tensorflow gpu card OOM when reuse=True on A30. but tensorflow 1.14 work OK on T4.

BingWin789 opened this issue · 0 comments

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 20.04.2 LTS
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: No
  • TensorFlow installed from (source or binary): pip
  • TensorFlow version (use command below): 1.15.5+nv22.8
  • Python version: 3.8.3
  • Bazel version (if compiling from source):
  • GCC/Compiler version (if compiling from source):
  • CUDA/cuDNN version: V11.1.105/
  • GPU model and memory: A30

You can collect some of this information using our environment capture
script
You can also obtain the TensorFlow version with: 1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)" 2. TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

Describe the current behavior
I used two datasets to train my model alternatively, the model share common weights. Like this:

with tf.variable_scope('mymodel', reuse=False):
    pred1 = model(dataset1)
with tf.variable_scope('mymodel', reuse=True):
    pred2 = model(dataset2)

Describe the expected behavior
I used batchsize of 12 to train my model. Tensorflow works OK on T4, but Nvidia-Tensorflow gpu card OOM on A30.

Code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate the problem.

Other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.