TGAN crashing at Epoch 1
nabarunaguha opened this issue · 3 comments
Hi,
I am facing this issue for some time and not able to fix this.
- Python version: 3.7
- Operating System: Linux
- TensorFlow version: 1.14.0
- CUDA version: 10.0
Description
I keep getting this warning and then the execution crashes at Epoch 1.
What I Did
import tensorflow as tf
if tf.test.gpu_device_name():
print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))
else:
print("Please install GPU version of TF")
And it shows tf is using GPU fine.
2019-10-03 13:11:01.720688: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-10-03 13:11:01.768834: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2596780000 Hz
2019-10-03 13:11:01.771431: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x56157647a930 executing computations on platform Host. Devices:
2019-10-03 13:11:01.771460: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): <undefined>, <undefined>
2019-10-03 13:11:01.772877: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2019-10-03 13:11:04.249822: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: Tesla M60 major: 5 minor: 2 memoryClockRate(GHz): 1.1775
pciBusID: 0000:04:00.0
2019-10-03 13:11:04.250926: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 1 with properties:
name: Tesla M60 major: 5 minor: 2 memoryClockRate(GHz): 1.1775
pciBusID: 0000:05:00.0
2019-10-03 13:11:04.251999: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 2 with properties:
name: Tesla M60 major: 5 minor: 2 memoryClockRate(GHz): 1.1775
pciBusID: 0000:09:00.0
2019-10-03 13:11:04.253103: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 3 with properties:
name: Tesla M60 major: 5 minor: 2 memoryClockRate(GHz): 1.1775
pciBusID: 0000:0a:00.0
2019-10-03 13:11:04.254193: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 4 with properties:
name: Tesla M60 major: 5 minor: 2 memoryClockRate(GHz): 1.1775
pciBusID: 0000:85:00.0
2019-10-03 13:11:04.255276: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 5 with properties:
name: Tesla M60 major: 5 minor: 2 memoryClockRate(GHz): 1.1775
pciBusID: 0000:86:00.0
2019-10-03 13:11:04.255566: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-10-03 13:11:04.256938: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-10-03 13:11:04.258142: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2019-10-03 13:11:04.258427: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2019-10-03 13:11:04.260019: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2019-10-03 13:11:04.261283: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2019-10-03 13:11:04.265096: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-10-03 13:11:04.277832: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0, 1, 2, 3, 4, 5
2019-10-03 13:11:04.277873: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-10-03 13:11:04.284987: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-10-03 13:11:04.285005: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0 1 2 3 4 5
2019-10-03 13:11:04.285013: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N Y Y Y N N
2019-10-03 13:11:04.285018: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 1: Y N Y Y N N
2019-10-03 13:11:04.285023: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 2: Y Y N Y N N
2019-10-03 13:11:04.285028: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 3: Y Y Y N N N
2019-10-03 13:11:04.285033: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 4: N N N N N Y
2019-10-03 13:11:04.285040: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 5: N N N N Y N
2019-10-03 13:11:04.293727: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/device:GPU:0 with 7647 MB memory) -> physical GPU (device: 0, name: Tesla M60, pci bus id: 0000:04:00.0, compute capability: 5.2)
2019-10-03 13:11:04.296282: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/device:GPU:1 with 7647 MB memory) -> physical GPU (device: 1, name: Tesla M60, pci bus id: 0000:05:00.0, compute capability: 5.2)
2019-10-03 13:11:04.298803: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/device:GPU:2 with 7647 MB memory) -> physical GPU (device: 2, name: Tesla M60, pci bus id: 0000:09:00.0, compute capability: 5.2)
2019-10-03 13:11:04.301310: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/device:GPU:3 with 7647 MB memory) -> physical GPU (device: 3, name: Tesla M60, pci bus id: 0000:0a:00.0, compute capability: 5.2)
2019-10-03 13:11:04.303979: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/device:GPU:4 with 7647 MB memory) -> physical GPU (device: 4, name: Tesla M60, pci bus id: 0000:85:00.0, compute capability: 5.2)
2019-10-03 13:11:04.306456: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/device:GPU:5 with 7647 MB memory) -> physical GPU (device: 5, name: Tesla M60, pci bus id: 0000:86:00.0, compute capability: 5.2)
2019-10-03 13:11:04.310204: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x56157ab4cab0 executing computations on platform CUDA. Devices:
2019-10-03 13:11:04.310223: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): Tesla M60, Compute Capability 5.2
2019-10-03 13:11:04.310229: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (1): Tesla M60, Compute Capability 5.2
2019-10-03 13:11:04.310234: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (2): Tesla M60, Compute Capability 5.2
2019-10-03 13:11:04.310239: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (3): Tesla M60, Compute Capability 5.2
2019-10-03 13:11:04.310244: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (4): Tesla M60, Compute Capability 5.2
2019-10-03 13:11:04.310249: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (5): Tesla M60, Compute Capability 5.2
2019-10-03 13:11:04.314251: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: Tesla M60 major: 5 minor: 2 memoryClockRate(GHz): 1.1775
pciBusID: 0000:04:00.0
2019-10-03 13:11:04.315484: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 1 with properties:
name: Tesla M60 major: 5 minor: 2 memoryClockRate(GHz): 1.1775
pciBusID: 0000:05:00.0
2019-10-03 13:11:04.316567: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 2 with properties:
name: Tesla M60 major: 5 minor: 2 memoryClockRate(GHz): 1.1775
pciBusID: 0000:09:00.0
2019-10-03 13:11:04.317632: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 3 with properties:
name: Tesla M60 major: 5 minor: 2 memoryClockRate(GHz): 1.1775
pciBusID: 0000:0a:00.0
2019-10-03 13:11:04.318705: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 4 with properties:
name: Tesla M60 major: 5 minor: 2 memoryClockRate(GHz): 1.1775
pciBusID: 0000:85:00.0
2019-10-03 13:11:04.319780: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 5 with properties:
name: Tesla M60 major: 5 minor: 2 memoryClockRate(GHz): 1.1775
pciBusID: 0000:86:00.0
2019-10-03 13:11:04.319806: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-10-03 13:11:04.319820: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-10-03 13:11:04.319833: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2019-10-03 13:11:04.319846: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2019-10-03 13:11:04.319859: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2019-10-03 13:11:04.319872: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2019-10-03 13:11:04.319885: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-10-03 13:11:04.332488: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0, 1, 2, 3, 4, 5
2019-10-03 13:11:04.332811: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-10-03 13:11:04.332823: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0 1 2 3 4 5
2019-10-03 13:11:04.332830: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N Y Y Y N N
2019-10-03 13:11:04.332835: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 1: Y N Y Y N N
2019-10-03 13:11:04.332840: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 2: Y Y N Y N N
2019-10-03 13:11:04.332845: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 3: Y Y Y N N N
2019-10-03 13:11:04.332850: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 4: N N N N N Y
2019-10-03 13:11:04.332856: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 5: N N N N Y N
2019-10-03 13:11:04.340711: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/device:GPU:0 with 7647 MB memory) -> physical GPU (device: 0, name: Tesla M60, pci bus id: 0000:04:00.0, compute capability: 5.2)
2019-10-03 13:11:04.341796: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/device:GPU:1 with 7647 MB memory) -> physical GPU (device: 1, name: Tesla M60, pci bus id: 0000:05:00.0, compute capability: 5.2)
2019-10-03 13:11:04.342889: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/device:GPU:2 with 7647 MB memory) -> physical GPU (device: 2, name: Tesla M60, pci bus id: 0000:09:00.0, compute capability: 5.2)
2019-10-03 13:11:04.343989: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/device:GPU:3 with 7647 MB memory) -> physical GPU (device: 3, name: Tesla M60, pci bus id: 0000:0a:00.0, compute capability: 5.2)
2019-10-03 13:11:04.345103: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/device:GPU:4 with 7647 MB memory) -> physical GPU (device: 4, name: Tesla M60, pci bus id: 0000:85:00.0, compute capability: 5.2)
2019-10-03 13:11:04.346189: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/device:GPU:5 with 7647 MB memory) -> physical GPU (device: 5, name: Tesla M60, pci bus id: 0000:86:00.0, compute capability: 5.2)
Default GPU Device: /device:GPU:0
I set the argument of GPU in TGANModel to '/GPU:0' and also tried with '/device:GPU:0'
But, it is the same warning and the crash just while running the first epoch.
I also uninstalled and re-installed Tensorflow-gpu and TGAN, just to check but of no use.
Regards,
Nabaruna
Would you mind sharing a short code snippet that shows the exact arguments that you use when creating the TGAN instance and calling the fit and sample methods?
We will then try to reproduce the error to be able to assist you better.
Also, regarding the GPU usage, please check this other issue: #34
So, basically, the gpu
argument is now being ignored, and all that matters in regards of GPU usage is whether you have installed tensorflow
or tensorflow-gpu
.
Hi @csala ,
Yeah sure, here are my arguments.
from tgan.model import TGANModel
tgan = TGANModel(continuous_columns, output='output', gpu='/device:GPU:0', max_epoch=5, steps_per_epoch=150, save_checkpoints=False, restore_session=False, batch_size=50, z_dim=50, noise=0.2, l2norm=0.00001, learning_rate=0.001, num_gen_rnn=100, num_gen_feature=100, num_dis_layers=1, num_dis_hidden=100, optimizer='AdamOptimizer')
tgan.fit(data)
model_path = '/home/naguha/ModelSave/ModelCheck.pkl'
num_samples = 20868
samples = tgan.sample(num_samples)
export_csv = samples.to_csv(r'/home/naguha/Samples_TGAN.csv',index = None, header=True)
And I installed tensorflow-gpu==1.14
Hello, any news for this issue ??