Current bug of GPU-binding.
Opened this issue · 2 comments
JdotX commented
Bug discribe
For the RDMA version of tensorflow, current problem is binding parameter-server to CPUs instead of GPUs. If we run the program like:
CUDA_VISIBLE_DEVICES="" python AutoencoderRunner.py --job_name="ps" --task_index=0 >> $dir/output-ps1 &
and start the workers using correct options, the parameter servers (as far as I tested, randomly one of them) would report:
Check failed: (buffer_size == size_ && rm.data_type_ != DT_STRING) || (buffer_size <= size_ && rm.data_type_ == DT_STRING) tensor and buffer size do not agree! buffer_size = 709 requested tensor size = 593Tensor<type: int64 shape: [0,1] values: >
Complete log is attached below.
However, if we enable using GPUs for parameter-servers, the bug disappears and program runs normally. Binding with CPU-parameterserver and GPU-worker group is tested on official TF-1.0 and worked.
output when bug appears
This is the output on parameter servers, workers just output normally, only not showing if it started to work.
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
E tensorflow/stream_executor/cuda/cuda_driver.cc:509] failed call to cuInit: CUDA_ERROR_NO_DEVICE
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:158] retrieving CUDA diagnostic information for host: ip-192-168-2-203
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: ip-192-168-2-203
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: 375.26.0
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:363] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module 375.26 Thu Dec 8 18:36:43 PST 2016
GCC version: gcc version 4.9.2 (Debian 4.9.2-10)
"""
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported version is: 375.26.0
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:300] kernel version seems to match DSO: 375.26.0
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:200] Initialize GrpcChannelCache for job ps -> {0 -> localhost:12300}
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:200] Initialize GrpcChannelCache for job worker -> {0 -> 10.40.199.203:12200, 1 -> 10.40.199.203:12201}
I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:241] Started server with target: grpc://localhost:12300
I tensorflow/core/distributed_runtime/rdma/rdma_mgr.cc:38] connecting to remote node /job:worker/replica:0/task:1
I tensorflow/core/distributed_runtime/rdma/rdma.cc:515] channel already connected
I tensorflow/core/distributed_runtime/rdma/rdma_mgr.cc:38] connecting to remote node /job:worker/replica:0/task:0
I tensorflow/core/distributed_runtime/rdma/rdma.cc:515] channel already connected
F tensorflow/core/distributed_runtime/rdma/rdma.cc:765] Check failed: (buffer_size == size_ && rm.data_type_ != DT_STRING) || (buffer_size <= size_ && rm.data_type_ == DT_STRING) tensor and buffer size do not agree! buffer_size = 709 req\
uested tensor size = 593Tensor<type: int64 shape: [0,1] values: >
Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz
byronyi commented
I am on this.
byronyi commented
Current build does not enable RDMA even if it is set in ./configure
. Please check again.