sampepose/flownet2-tf

Different error when running test: Undefined symbol: _ZN10tensorflow3PadERKN5Eigen9GpuDeviceEPKfiiiiiiPf

fperezgamonal opened this issue ยท 10 comments

Hello all,

After successfully compiling the code by addressing some problems through issues #76 , #65 , #28 , when I run
python -m src.flownet2.test --input_a data/samples/0img0.ppm --input_b data/samples/0img1.ppm --out ./, the following error is reported:

  File "/soft/easybuild/debian/8.8/Broadwell/software/Tensorflow-gpu/1.10.0-foss-2017a-Python-3.6.4/lib/python3.6/site-packages/tensorflow/python/framework/load_library.py", line 56, in load_op_library
    lib_handle = py_tf.TF_LoadLibrary(library_filename)
tensorflow.python.framework.errors_impl.NotFoundError: /homedtic/fperez/Documents/Papers_code/TFM/state_of_the_art/DL/FlowNet2/flownet2-tf/src/./ops/build/correlation.so: undefined symbol: _ZN10tensorflow3PadERKN5Eigen9GpuDeviceEPKfiiiiiiPf

I have tried the proposed solutions for undefined symbol-related errors (issues #8 , #41 and #87 ) without success. I have noticed that the undefined symbol is different from any other post on this repository so I have checked on the tensorflow repository for similar errors and only found this issue which suggests to recompile without GPU support and adding "-c" flag but I do not know how to it applies to my case (and compiling only under CPU will yield very slow training and inference...).

Makefile

My makefile looks as follows:

# Makefile

$(info    CUDA_HOME is $(CUDA_HOME))
TF_INC = `python -c "import tensorflow; print(tensorflow.sysconfig.get_include())"`
TF_LIB = `python -c "import tensorflow; print(tensorflow.sysconfig.get_lib())"`
#ifndef CUDA_HOME
#    CUDA_HOME := /usr/local/cuda
#endif
#CUDA_HOME_C=${CUDA_HOME}

CC        = gcc -O2 -pthread
CXX       = g++
GPUCC     = nvcc --expt-relaxed-constexpr
CFLAGS    = -std=c++11 -I$(TF_INC) -I"$(CUDA_HOME)/include" -DNDEBUG -D_GLIBCXX_USE_CXX11_ABI=0
GPUCFLAGS = -c
LFLAGS    = -pthread -shared -fPIC
GPULFLAGS = -x cu -Xcompiler -fPIC
CGPUFLAGS = -L$(CUDA_HOME)/lib -L$(CUDA_HOME)/lib64 -lcudart -L$(TF_LIB) -ltensorflow_framework

OUT_DIR   = src/ops/build
PREPROCESSING_SRC = "src/ops/preprocessing/preprocessing.cc" "src/ops/preprocessing/kernels/flow_augmentation.cc" "src/ops/preprocessing/kernels/augmentation_base.cc" "src/ops/preprocessing/kernels/data_augmentation.cc"
GPU_SRC_DATA_AUG        = src/ops/preprocessing/kernels/data_augmentation.cu.cc
GPU_SRC_FLOW            = src/ops/preprocessing/kernels/flow_augmentation_gpu.cu.cc
GPU_PROD_DATA_AUG       = $(OUT_DIR)/data_augmentation.o
GPU_PROD_FLOW           = $(OUT_DIR)/flow_augmentation_gpu.o
PREPROCESSING_PROD      = $(OUT_DIR)/preprocessing.so

DOWNSAMPLE_SRC = "src/ops/downsample/downsample_kernel.cc" "src/ops/downsample/downsample_op.cc"
GPU_SRC_DOWNSAMPLE  = src/ops/downsample/downsample_kernel_gpu.cu.cc
GPU_PROD_DOWNSAMPLE = $(OUT_DIR)/downsample_kernel_gpu.o
DOWNSAMPLE_PROD         = $(OUT_DIR)/downsample.so

CORRELATION_SRC = "src/ops/correlation/correlation_kernel.cc" "src/ops/correlation/correlation_grad_kernel.cc" "src/ops/correlation/correlation_op.cc"
GPU_SRC_CORRELATION  = src/ops/correlation/correlation_kernel.cu.cc
GPU_SRC_CORRELATION_GRAD  = src/ops/correlation/correlation_grad_kernel.cu.cc
GPU_SRC_PAD = src/ops/correlation/pad.cu.cc
GPU_PROD_CORRELATION = $(OUT_DIR)/correlation_kernel_gpu.o
GPU_PROD_CORRELATION_GRAD = $(OUT_DIR)/correlation_grad_kernel_gpu.o
GPU_PROD_PAD = $(OUT_DIR)/correlation_pad_gpu.o
CORRELATION_PROD        = $(OUT_DIR)/correlation.so

FLOWWARP_SRC = "src/ops/flow_warp/flow_warp_op.cc" "src/ops/flow_warp/flow_warp.cc" "src/ops/flow_warp/flow_warp_grad.cc"
GPU_SRC_FLOWWARP = "src/ops/flow_warp/flow_warp.cu.cc"
GPU_SRC_FLOWWARP_GRAD = "src/ops/flow_warp/flow_warp_grad.cu.cc"
GPU_PROD_FLOWWARP = "$(OUT_DIR)/flow_warp_gpu.o"
GPU_PROD_FLOWWARP_GRAD = "$(OUT_DIR)/flow_warp_grad_gpu.o"
FLOWWARP_PROD = "$(OUT_DIR)/flow_warp.so"

ifeq ($(OS),Windows_NT)
    detected_OS := Windows
else
    detected_OS := $(shell sh -c 'uname -s 2>/dev/null || echo not')
endif
ifeq ($(detected_OS),Darwin)  # Mac OS X
        CGPUFLAGS += -undefined dynamic_lookup
endif
ifeq ($(detected_OS),Linux)
        CFLAGS += -D_MWAITXINTRIN_H_INCLUDED -D_FORCE_INLINES -D__STRICT_ANSI__ -D_GLIBCXX_USE_CXX11_ABI=0
endif

all: preprocessing downsample correlation flowwarp

preprocessing:
        $(GPUCC) -g $(CFLAGS) $(GPUCFLAGS) $(GPU_SRC_DATA_AUG) $(GPULFLAGS) $(GPUDEF) -o $(GPU_PROD_DATA_AUG)
        $(GPUCC) -g $(CFLAGS) $(GPUCFLAGS) $(GPU_SRC_FLOW) $(GPULFLAGS) $(GPUDEF) -o $(GPU_PROD_FLOW)
        $(CXX) -g $(CFLAGS)  $(PREPROCESSING_SRC) $(GPU_PROD_DATA_AUG) $(GPU_PROD_FLOW) $(LFLAGS) $(CGPUFLAGS) -o $(PREPROCESSING_PROD)

downsample:
        $(GPUCC) -g $(CFLAGS) $(GPUCFLAGS) $(GPU_SRC_DOWNSAMPLE) $(GPULFLAGS) $(GPUDEF) -o $(GPU_PROD_DOWNSAMPLE)
        $(CXX) -g $(CFLAGS)  $(DOWNSAMPLE_SRC) $(GPU_PROD_DOWNSAMPLE) $(LFLAGS) $(CGPUFLAGS) -o $(DOWNSAMPLE_PROD)

correlation:
        $(GPUCC) -g $(CFLAGS) $(GPUCFLAGS) $(GPU_SRC_CORRELATION) $(GPULFLAGS) $(GPUDEF) -o $(GPU_PROD_CORRELATION)
        $(GPUCC) -g $(CFLAGS) $(GPUCFLAGS) $(GPU_SRC_CORRELATION_GRAD) $(GPULFLAGS) $(GPUDEF) -o $(GPU_PROD_CORRELATION_GRAD)
        $(GPUCC) -g $(CFLAGS) $(GPUCFLAGS) $(GPU_SRC_PAD) $(GPULFLAGS) $(GPUDEF) -o $(GPU_PROD_PAD)
        $(CXX) -g $(CFLAGS)  $(CORRELATION_SRC) $(GPU_PROD_CORRELATION) $(GPU_PROD_CORRELATION_GRAD) $(GPU_PROD_PAD) $(LFLAGS) $(CGPUFLAGS) -o $(CORRELATION_PROD)

flowwarp:
        $(GPUCC) -g $(CFLAGS) $(GPUCFLAGS) $(GPU_SRC_FLOWWARP) $(GPULFLAGS) $(GPUDEF) -o $(GPU_PROD_FLOWWARP)
        $(GPUCC) -g $(CFLAGS) $(GPUCFLAGS) $(GPU_SRC_FLOWWARP_GRAD) $(GPULFLAGS) $(GPUDEF) -o $(GPU_PROD_FLOWWARP_GRAD)
        $(CXX) -g $(CFLAGS)  $(FLOWWARP_SRC) $(GPU_PROD_FLOWWARP) $(GPU_PROD_FLOWWARP_GRAD) $(LFLAGS) $(CGPUFLAGS) -o $(FLOWWARP_PROD)

clean:
        rm -f $(PREPROCESSING_PROD) $(GPU_PROD_FLOW) $(GPU_PROD_DATA_AUG) $(DOWNSAMPLE_PROD) $(GPU_PROD_DOWNSAMPLE)
                                                                                                                        

Environment

I am working remotely in a cluster (SLURM-based, loading modules instead of installing packages, etc.) with the following characteristics:

  • OS: Debian GNU/Linux 8 (jessie)
    And I have loaded the following version of the required libraries (numpy, scipy, etc. are included with the module for python):
  • Tensorflow GPU: 1.10.0
  • Python 3.6.4
  • Tkinter 3.6.4
  • pypng 0.0.19
  • GCC 6.3.0-2.27

I have tried other versions of tensorflow-gpu (1.5.0 and 1.12.0) with the same results.
One thing I have noticed is that in the cluster, inside the CUDA_HOME, there is no lib folder but only lib64
image

As commented above, I have tried a combination of different proposed solutions without success and now I am running out of ideas although I fear it is related to problems of working in a cluster and loading modules (had to remove -DGOOGLE_CUDA=1 in order to compile successfully as suggested by the cluster technical staff).

Additionally, if I remove -DNDEBUG -D_GLIBCXX_USE_CXX11_ABI=0 from the flags, the same error rises after successful compilation.

Thanks for your time! Any help would be greatly appreciated. I'll keep this post updated if I try anything different.

Cheers,
Ferran.

UPDATE: since I did not found many cases about this error code containing "gpudevice" I am currently trying to include -DGOOGLE_CUDA=1 again because I think the former error is related to not finding any GPU. Now I receive the "cuda/include/cuda.h" : no such file or directory as in issue #45 but the resolution there does not fix my problem. I will keep investigating since solutions like changing the header that produces the error is not possible as I am working on a cluster without writing access to such files.

Final update: after fighting with it for quite a few days and with help with my university's IT staff, I got it solved. A soft link for cuda.h was the solution (and keep the Makefile as shown above if I am not mistaken).

I will close this issue now, feel free to open it if you encounter a similar problem and I'll try to help you as much as possible.

Cheers.

Final update: after fighting with it for quite a few days and with help with my university's IT staff, I got it solved. A soft link for cuda.h was the solution (and keep the Makefile as shown above if I am not mistaken).

I will close this issue now, feel free to open it if you encounter a similar problem and I'll try to help you as much as possible.

Cheers.

Hello sir, what do you mean by "A soft link for cuda.h was the solution"

how you do it ?

Hello @seni04 the technical stuff told me they had fixed by creating a soft link between the actual cuda version on the PC and the "standard" path where it is normally installed.

I assume they did something like:

ln -s /usr/bin/cuda-10.0 /usr/bin/cuda
But using the actual path where you installed CUDA as the first argument.
I'm sorry I cannot give you more details but I've just checked my IT tickets and found no extra details.
I hope this helps you,
PS: here is the actual (last) Makefile I used in any case (rename it back to Makefile)
Makefile.txt

Cheers,

Ferran.

Hello @seni04 the technical stuff told me they had fixed by creating a soft link between the actual cuda version on the PC and the "standard" path where it is normally installed.

I assume they did something like:

ln -s /usr/bin/cuda-10.0 /usr/bin/cuda
But using the actual path where you installed CUDA as the first argument.
I'm sorry I cannot give you more details but I've just checked my IT tickets and found no extra details.
I hope this helps you,
PS: here is the actual (last) Makefile I used in any case (rename it back to Makefile)
Makefile.txt

Cheers,

Ferran.

nvcc -c --expt-relaxed-constexpr -g -std=c++11 -DNDEBUG -I/usr/local/lib/python2.7/dist-packages/tensorflow/include -I"/usr/local/cuda-9.0/include" -DGOOGLE_CUDA=1 -D_MWAITXINTRIN_H_INCLUDED -D_FORCE_INLINES -D__STRICT_ANSI__ -D_GLIBCXX_USE_CXX11_ABI=0 src/ops/preprocessing/kernels/data_augmentation.cu.cc -x cu -Xcompiler -fPIC -o src/ops/build/data_augmentation.o
In file included from /usr/local/lib/python2.7/dist-packages/tensorflow/include/tensorflow/core/util/cuda_kernel_helper.h:21:0,
from src/ops/preprocessing/kernels/data_augmentation.cu.cc:7:
/usr/local/lib/python2.7/dist-packages/tensorflow/include/tensorflow/core/util/cuda_device_functions.h:32:31: fatal error: cuda/include/cuda.h: No such file or directory
compilation terminated.
Makefile:68: recipe for target 'preprocessing' failed
make: *** [preprocessing] Error 1

iam still get this error, already using the same makefile like yours

Hello again,

I am very sorry to see you are still facing the same issues. I totally understand your frustration since I was totally unable to successfully compile the ops in another computer to try to run more experiments in parallel (and I had the same configuration and Makefile!).

The only thing I can thing of is searching for this error since it is very reoccurring and try some of the proposed solutions and see if it works.
By the way, if you happen to solve this issue and run into a missing library (libcupti), I have just how I solved that. I did so by adding the path to the library to the LD_LIBRARY_PATH environment variable , as follows:
export LD_LIBRARY_PATH=/soft/easybuild/debian/8.8/Broadwell/software/CUDA/9.0.176/extras/CUPTI/lib64:$LD_LIBRARY_PATH

If I can find any more information on how to solve your error, I will post it here.
I wish you luck!

PS: I'll leave this open so more people can see this issue and hopefully provide a solution.
Cheers,
Feran.

Hi.

I am facing a similar issue as well. I am trying to run a pre-trained styleGAN model (https://github.com/NVlabs/stylegan2) on my JupyterLab in a Tensorflow 1.14 GPU environment.

So, when I try to run the python code python run_generator.py generate-images --network=gdrive:networks/stylegan2-ffhq-config-f.pkl --seeds=6600-6625 --truncation-psi=0.5 as given in the link, I get the following error:

tensorflow.python.framework.errors_impl.NotFoundError: /trainman-mount/trainman-storage-d2b580e4-067b-44d3-9be3-be48cc5f0d71/stylegan2/dnnlib/tflib/_cudacache/fused_bias_act_1ac15fee5b354fc0d3aa1e7f98502e64.so: undefined symbol: _ZN10tensorflow12OpDefBuilder6OutputESs

I have no idea what does this _ZN10tensorflow12OpDefBuilder6OutputESs mean, but seems similar to the one raised in this thread. I also tried finding solutions for this error but all of them revolve around modifying some Makefile and there doesn't seem to be any use of a makefile for my problem since I am just running python code.

Any help will be much appreciated :)

I am facing the same issue. Trying to get this to work on my university's cluster and facing the same issue. I was able to get it working fine on my Windows machine, and my group has been able to get it to work on an EC2 instance, so I have no idea what the issue is exactly. From what I can tell, all the correct dependencies are installed... @Vedant2311 did you come up with a solution?

Hi.

I am facing a similar issue as well. I am trying to run a pre-trained styleGAN model (https://github.com/NVlabs/stylegan2) on my JupyterLab in a Tensorflow 1.14 GPU environment.

So, when I try to run the python code python run_generator.py generate-images --network=gdrive:networks/stylegan2-ffhq-config-f.pkl --seeds=6600-6625 --truncation-psi=0.5 as given in the link, I get the following error:

tensorflow.python.framework.errors_impl.NotFoundError: /trainman-mount/trainman-storage-d2b580e4-067b-44d3-9be3-be48cc5f0d71/stylegan2/dnnlib/tflib/_cudacache/fused_bias_act_1ac15fee5b354fc0d3aa1e7f98502e64.so: undefined symbol: _ZN10tensorflow12OpDefBuilder6OutputESs

I have no idea what does this _ZN10tensorflow12OpDefBuilder6OutputESs mean, but seems similar to the one raised in this thread. I also tried finding solutions for this error but all of them revolve around modifying some Makefile and there doesn't seem to be any use of a makefile for my problem since I am just running python code.

Any help will be much appreciated :)

In file stylegan2/dnnlib/tflib/custom_ops.py, line 127:
change from
compile_opts += โ€™ --compiler-options \โ€™-fPIC -D_GLIBCXX_USE_CXX11_ABI=0\โ€™โ€™
to
compile_opts += โ€™ --compiler-options \โ€™-fPIC -D_GLIBCXX_USE_CXX11_ABI=1\โ€™โ€™

Hi.
I am facing a similar issue as well. I am trying to run a pre-trained styleGAN model (https://github.com/NVlabs/stylegan2) on my JupyterLab in a Tensorflow 1.14 GPU environment.
So, when I try to run the python code python run_generator.py generate-images --network=gdrive:networks/stylegan2-ffhq-config-f.pkl --seeds=6600-6625 --truncation-psi=0.5 as given in the link, I get the following error:

tensorflow.python.framework.errors_impl.NotFoundError: /trainman-mount/trainman-storage-d2b580e4-067b-44d3-9be3-be48cc5f0d71/stylegan2/dnnlib/tflib/_cudacache/fused_bias_act_1ac15fee5b354fc0d3aa1e7f98502e64.so: undefined symbol: _ZN10tensorflow12OpDefBuilder6OutputESs

I have no idea what does this _ZN10tensorflow12OpDefBuilder6OutputESs mean, but seems similar to the one raised in this thread. I also tried finding solutions for this error but all of them revolve around modifying some Makefile and there doesn't seem to be any use of a makefile for my problem since I am just running python code.
Any help will be much appreciated :)

In file stylegan2/dnnlib/tflib/custom_ops.py, line 127:
change from
compile_opts += โ€™ --compiler-options \โ€™-fPIC -D_GLIBCXX_USE_CXX11_ABI=0\โ€™โ€™
to
compile_opts += โ€™ --compiler-options \โ€™-fPIC -D_GLIBCXX_USE_CXX11_ABI=1\โ€™โ€™

thanks ahmedshingaly, this solved the similar issue for me

Also solved the issue for me. Would've been impossible to debug; thank you!