tensorflow and torch are right when verify the GPU and CUDA, but the code works on CPU only.

Question

tensorflow and torch are right when verify the GPU and CUDA, but the code works on CPU only.

Closed this issue 2 years ago · 1 comments

I. device: rtx 3090, driver version: 515.86.01, cuda version: 11.7, python version=3.8.16
for tensorflow, I install the nvidia-tensorflow=1.15.5+nv22.05
and torch=1.13.0, torchvison=0.14.0, torchaudio=0.13.0

II. when I run print(tf.test.is_gpu_available()), it returns True
and I run print(torch.cuda.is_available()), also True.

III. However, in vscode I run python TranSG.py --dataset KGBD --probe probe,
it only work on CPU

there below are my running result, thank everybody for your help and reading!!!

python TranSG.py --dataset KGBD --probe probe
2023-03-31 14:48:35.072934: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
WARNING:root:Limited tf.compat.v2.summary API due to missing TensorBoard installation.
WARNING:tensorflow:From TranSG.py:19: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

WARNING:tensorflow:From TranSG.py:19: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

----- Model hyperparams -----
f (sequence length): 6
H (embedding size): 128
SGT Layers: 2
FR heads: 8
alpha: 0.5
beta: 0.5
lambda: 0.5
a (structure): 10
b (trajectory): 2
t1: 0.07
t2: 14
batch_size: 256
lr: 0.00035
patience: 60
Mode: Train
----- Dataset Information -----
Dataset: KGBD
Probe: probe
WARNING:tensorflow:From TranSG.py:241: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

WARNING:tensorflow:From TranSG.py:241: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

WARNING:tensorflow:From TranSG.py:254: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead.

WARNING:tensorflow:From TranSG.py:254: The name tf.AUTO_REUSE is deprecated. Please use tf.compat.v1.AUTO_REUSE instead.

WARNING:tensorflow:From TranSG.py:261: The name tf.train.AdamOptimizer is deprecated. Please use tf.compat.v1.train.AdamOptimizer instead.

WARNING:tensorflow:From TranSG.py:269: The name tf.random_normal is deprecated. Please use tf.random.normal instead.

concat_features (Spatial) Tensor("TranSG/TranSG/concat_13:0", shape=(256, 6, 20, 128), dtype=float32)
WARNING:tensorflow:From TranSG.py:324: The name tf.losses.absolute_difference is deprecated. Please use tf.compat.v1.losses.absolute_difference instead.

WARNING:tensorflow:From TranSG.py:324: The name tf.losses.absolute_difference is deprecated. Please use tf.compat.v1.losses.absolute_difference instead.

WARNING:tensorflow:From TranSG.py:416: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead.

WARNING:tensorflow:From TranSG.py:417: The name tf.global_variables_initializer is deprecated. Please use tf.compat.v1.global_variables_initializer instead.

WARNING:tensorflow:From TranSG.py:417: The name tf.local_variables_initializer is deprecated. Please use tf.compat.v1.local_variables_initializer instead.

WARNING:tensorflow:From TranSG.py:419: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2023-03-31 14:48:38.133041: I tensorflow/core/platform/profile_utils/cpu_utils.cc:109] CPU Frequency: 3187200000 Hz
2023-03-31 14:48:38.133485: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1dcd43d0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2023-03-31 14:48:38.133496: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2023-03-31 14:48:38.133969: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2023-03-31 14:48:38.171983: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1082] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-31 14:48:38.172172: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1dd4e680 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-03-31 14:48:38.172187: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): NVIDIA GeForce RTX 3090, Compute Capability 8.6
2023-03-31 14:48:38.172296: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1082] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-31 14:48:38.172352: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1666] Found device 0 with properties:
name: NVIDIA GeForce RTX 3090 major: 8 minor: 6 memoryClockRate(GHz): 1.755
pciBusID: 0000:01:00.0
2023-03-31 14:48:38.172363: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2023-03-31 14:48:38.172390: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2023-03-31 14:48:38.185206: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2023-03-31 14:48:38.185354: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2023-03-31 14:48:38.185606: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.11
2023-03-31 14:48:38.185940: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2023-03-31 14:48:38.185963: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2023-03-31 14:48:38.185999: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1082] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-31 14:48:38.186064: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1082] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-31 14:48:38.186099: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1794] Adding visible gpu devices: 0
2023-03-31 14:48:38.186112: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2023-03-31 14:48:38.287833: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1206] Device interconnect StreamExecutor with strength 1 edge matrix:
2023-03-31 14:48:38.287856: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] 0
2023-03-31 14:48:38.287860: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1225] 0: N
2023-03-31 14:48:38.288022: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1082] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-31 14:48:38.288108: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1082] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-31 14:48:38.288158: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1351] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 19367 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:01:00.0, compute capability: 8.6)
2023-03-31 14:48:40.127634: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2023-03-31 14:48:40.467065: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
TranSG.py:610: UserWarning: This overload of addmm_ is deprecated:
addmm_(Number beta, Number alpha, Tensor mat1, Tensor mat2)
Consider using one of the following signatures instead:
addmm_(Tensor mat1, Tensor mat2, *, Number beta, Number alpha) (Triggered internally at ../torch/csrc/utils/python_arg_parser.cpp:1420.)
dist_m.addmm_(1, -2, a, b.t())
[0] Batch num: 0 | STPR Loss: 0.10349 | GPC Loss: 200.80025 |
[0] Batch num: 20 | STPR Loss: 0.09568 | GPC Loss: 83.03362 |
[0] Batch num: 40 | STPR Loss: 0.09093 | GPC Loss: 55.42593 |
[0] Batch num: 60 | STPR Loss: 0.08801 | GPC Loss: 41.22432 |
[0] Batch num: 80 | STPR Loss: 0.08391 | GPC Loss: 34.67684 |
[0] Batch num: 100 | STPR Loss: 0.08058 | GPC Loss: 29.46987 |
[0] Batch num: 120 | STPR Loss: 0.07752 | GPC Loss: 26.16352 |
ReID_Models/KGBD/probe_f_6_layers_2_heads_8_alpha_0.5_beta_0.5_lambda_0.5/best.ckpt
[Probe Evaluation] KGBD - probe | Top-1: 0.3280 (0.3280) | Top-5: 0.5291 (0.5291) | Top-10: 0.6189 (0.6189) | mAP: 0.0492 (0.0492) |
0.3280-0.5291-0.6189-0.0492
[1] Batch num: 0 | STPR Loss: 0.07678 | GPC Loss: 119.31063 |

Answer 1 · 2023-06-29T13:27:33.000Z

I also want to know.