luanfujun/deep-photo-styletransfer

OSX "Segmentation fault: 11" when loading libcuda_utils.so in torch or executing the two lua files

Closed this issue · 16 comments

Hello, it's me again :-)

Unfortunately the compiled libcuda_utils.so is still not working as intended.

When i import it via torch:

th

  ______             __   |  Torch7
 /_  __/__  ________/ /   |  Scientific computing for Lua.
  / / / _ \/ __/ __/ _ \  |  Type ? for help
 /_/  \___/_/  \__/_//_/  |  https://github.com/torch
                          |  http://torch.ch

th> require 'libcuda_utils'
Segmentation fault: 11

torch abruptly quits with the error Segmentation fault: 11.

When i try to run the first lua script, the same thing happens:

th neuralstyle_seg.lua -content_image examples/input/in1.png -style_image examples/style/tar1.png -content_seg examples/segmentation/in1.png -style_seg examples/segmentation/tar1.png -index 1 -num_iterations 1000 -save_iter 100 -print_iter 1 -gpu 0 -serial examples/tmp_results
Segmentation fault: 11

And the process luajit also crashes.

If it is of any help, i've tried to get some info from my libcuda_utils.so via nm:

Nm displays the name list (symbol table) of each object file in the argument list. If an argument is an archive, a listing for each object file in
the archive will be produced. File can be of the form libx.a(x.o), in which case only symbols from that member of the object file are listed.

I tried the -a flag:

-a Display all symbol table entries, including those inserted for use by debuggers.

This is the output (too long to post it here): http://pastebin.com/J2PKBFvX

Could you please run nm -a [pathTo-libcuda_utils.so] on your file and compare your output to mine?
But just if this is not too tedious for you to check and only if it even helps in this case.

Do you have any ideas how i can check what is causing the Segmentation fault?
 
I guess that OSX needs more specific compiler options... something is still going wrong when compiling libcuda_utils.so, even though i get no errors.

Man, i was so excited to finally test it with my own images. The matlab code was executing fine, the 60 Laplacian-.mat files are ready to test. The only missing thing is getting the lua code to run in torch without crashing...

PS: My first idea to test the code would be using, say 60-120 still images from a 24h-timelapse sequence as style images and then apply them to a similar image that i have taken. Any kind of landscape – doesn't matter, as long as input and style are somewhat similar. Then take the transformed output images, animate them and watch how my image changes to different lighting scenarios from the timelapse. I won't give up till i see the result of this :-).

Ignore the message from above...
Again there was an error with the makefile.

Please also ignore what i have posted in the issue #2 – adding the compiler flag -llua caused the problem!

The flag -llua adds the lua libraries and in my case it tried to add the 5.2 libs i've installed with brew.
But torch uses 5.1 afaik. Only luajit (-lluajit) is needed.

So i had to brew unlink lua && brew unlink lua51 to even see that the compiler was trying to load the libraries from the brew Cellar folder...


These are now the correct flags for LDFLAGS_NVCC:
LDFLAGS_NVCC=-L$(PREFIX)/lib -Xlinker -rpath,$(PREFIX)/lib -lluaT -lTHC -lTH -lpng -lluajit


So finally i can try it out! Hopefully no more errors... otherwise i'll go mad.

First try:

th neuralstyle_seg.lua -content_image examples/input/in1.png -style_image examples/style/tar1.png -content_seg examples/segmentation/in1.png -style_seg examples/segmentation/tar1.png -index 1 -num_iterations 1000 -save_iter 100 -print_iter 1 -gpu 0 -serial examples/tmp_results
gpu, idx = 	0	1
[libprotobuf INFO google/protobuf/io/coded_stream.cc:610] Reading dangerously large protocol message.  If the message turns out to be larger than 1073741824 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:81] The total number of bytes read was 574671192
Successfully loaded models/VGG_ILSVRC_19_layers.caffemodel
conv1_1: 64 3 3 3
conv1_2: 64 64 3 3
conv2_1: 128 64 3 3
conv2_2: 128 128 3 3
conv3_1: 256 128 3 3
conv3_2: 256 256 3 3
conv3_3: 256 256 3 3
conv3_4: 256 256 3 3
conv4_1: 512 256 3 3
conv4_2: 512 512 3 3
conv4_3: 512 512 3 3
conv4_4: 512 512 3 3
conv5_1: 512 512 3 3
conv5_2: 512 512 3 3
conv5_3: 512 512 3 3
conv5_4: 512 512 3 3
fc6: 1 1 25088 4096
fc7: 1 1 4096 4096
fc8: 1 1 4096 1000
Exp serial:	examples/tmp_results
Setting up style layer  	2	:	relu1_1
Setting up style layer  	7	:	relu2_1
Setting up style layer  	12	:	relu3_1
Setting up style layer  	21	:	relu4_1
THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-6848/cutorch/lib/THC/generic/THCStorage.cu line=66 error=2 : out of memory

I will have to try again when my 60+ Chrome tabs are closed :-)

Apparently this is not enough VRAM:
Device 0 [PCIe 0:1:0.0]: GeForce GTX 780 (CC 3.5): 3267.4 of 6143.8 MB (i.e. 53.2%) Free


One more request:
Could you check how much VRAM your card requires for processing one image – just a rough number?
For the first image of the examples:
th neuralstyle_seg.lua -content_image examples/input/in1.png -style_image examples/style/tar1.png -content_seg examples/segmentation/in1.png -style_seg examples/segmentation/tar1.png -index 1 -num_iterations 1000 -save_iter 100 -print_iter 1 -gpu 0 -serial examples/tmp_results

Unfortunately i don't have a octo-GPU-cluster. But i would gladly switch my dusty 780 +i7 2600k with your setup :-)

gen_all.py
# number of GPUs available
numGpus = 8 😍

I have to get along with the 6GB. But i'm looking forward to get a sweet 1080 Ti – 11GB offer you way more flexibility. And finally real high res images from all the neural style algorithms!

Please tell me that 6GB are enough – i don't want to process multiple images in parallel.
Just one converted image at a time would be OK.

Thanks for looking into it, but i think you just read what caused the problem.

Now i have to try again with >3300 MB free VRAM. Will post it if works!

The images in examples/ are resized to 700 (large dimension), which usually takes 5~7GB memory. You could try to resize the images to a smaller resolution to save gpu memory.

However, there is one trick to support very large-res images by using small-res in gpu then apply joint bilateral upsampling on the local affine matrices onto the large-res images. I will explain how to do that in the supplemental material once the submission is finished...

Thanks for the trick with the upsampling – but i don't think i'm able to implement it myself :-). I guess i'll have to wait till you add this function.

Just out of curiosity: How does "joint bilateral upsampling on the local affine matrices" work?
In simple terms :-).

I've tried again with ~4200 MB free VRAM, but after
Setting up content layer 23 : relu4_2

it crashes again. So to process the 700 px images you need at least a 8 GB card. Ah that's nice – because there are no OSX drivers out yet for the 10XX cards. The best card on OSX right now is the GTX 980 Ti, and it also has "only" 6GB VRAM.

Will try to reduce the images and post if it works.

Resized the first image to 400 x 225px and its working (at the time 😀) – it needs 3,5 GB:

th neuralstyle_seg.lua -content_image examples/input/in1-400.png -style_image examples/style/tar1-400.png -content_seg examples/segmentation/in1-400.png -style_seg examples/segmentation/tar1-400.png -index 1 -num_iterations 1000 -save_iter 100 -print_iter 1 -gpu 0 -serial examples/tmp_results
gpu, idx = 	0	1
[libprotobuf INFO google/protobuf/io/coded_stream.cc:610] Reading dangerously large protocol message.  If the message turns out to be larger than 1073741824 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:81] The total number of bytes read was 574671192
Successfully loaded models/VGG_ILSVRC_19_layers.caffemodel
conv1_1: 64 3 3 3
conv1_2: 64 64 3 3
conv2_1: 128 64 3 3
conv2_2: 128 128 3 3
conv3_1: 256 128 3 3
conv3_2: 256 256 3 3
conv3_3: 256 256 3 3
conv3_4: 256 256 3 3
conv4_1: 512 256 3 3
conv4_2: 512 512 3 3
conv4_3: 512 512 3 3
conv4_4: 512 512 3 3
conv5_1: 512 512 3 3
conv5_2: 512 512 3 3
conv5_3: 512 512 3 3
conv5_4: 512 512 3 3
fc6: 1 1 25088 4096
fc7: 1 1 4096 4096
fc8: 1 1 4096 1000
Exp serial:	examples/tmp_results
Setting up style layer  	2	:	relu1_1
Setting up style layer  	7	:	relu2_1
Setting up style layer  	12	:	relu3_1
Setting up style layer  	21	:	relu4_1
Setting up content layer	23	:	relu4_2
Setting up style layer  	30	:	relu5_1
WARNING: Skipping content loss
Iteration 1 / 1000
  Content 1 loss: 2891531.875000
  Style 1 loss: 181216.455365
  Style 2 loss: 2160761.679936
  Style 3 loss: 472989.727524
  Style 4 loss: 31317802.261623
  Style 5 loss: 2534.529370
  Total loss: 37026836.528817
<optim.lbfgs> 	creating recyclable direction/step/history buffers
Iteration 2 / 1000
  Content 1 loss: 2891529.375000
  Style 1 loss: 181216.455365
  Style 2 loss: 2160761.618892
  Style 3 loss: 472989.727524
  Style 4 loss: 31317801.369620
  Style 5 loss: 2534.529370
  Total loss: 37026833.075770
Iteration 3 / 1000
  Content 1 loss: 2759849.687500
  Style 1 loss: 181093.108931
  Style 2 loss: 2133093.635516
  Style 3 loss: 464212.530519
  Style 4 loss: 31016996.054152
  Style 5 loss: 2528.815278

Great to hear that! : )

The locally affine function maps input RGB to output RGB by multiplying a spatial-varying matrix onto input RGB pixel by pixel. The upsampling idea in short is to upsample those matrices using large-res input guidance, and then apply the large-res matrices to reconstruct the final larger-res output. (one sample attached below to illustrate its effect, images with resolution: 3500 x 2340)
input
output

That looks great! The only think that seems a little bit odd when converting daytime -> night are the shadows (illuminated surfaces and directions of shadows).

But the results are nonetheless amazing. Especially the fake infrared ones! 3 years ago i've modified my old Nikon D80 to be able to take Infrared pictures – so i have some experience with real infrared images. And i guess if i would take my old test images now (with and w/o infrared filter), transform the non-IR images with your code and then compare the results, there wouldn't be much of a difference between the generated and real IR-image.

I think i can crank up the resolution to max. 500px. Don't know why OSX needs 2GB reserved for the OS...

Maybe i can somehow manage to get more than 4,2 GB free and try it again with the original images.

This is my result for the temp image:
out1_t_1000

The result best1_t_100.png:
best1_t_100

The result best1_t_200.png:
best1_t_200

And now something went wrong...
The result best1_t_300.png:
best1_t_300

Final result:
best1_t_1000

Does it even work correctly when i use the Input_Laplacian_3x3_1e-7_CSR1.mat that was generated with/for the 700px image?

Do i need to get back into Matlab and generate the Laplacian file for my 400px image?

Yes you will need to re-generate matting Laplacian matrix under 400px resolution.

Ah i guess i need to change the following in gen_laplacian.m:
input = reshape_img(input, 700);

Would it be enough to just change the 700 to 400 ?

I will batch-resize all the examples to 400px width in the meantime.

Yes that should be enough. One thing you need to be careful about is that the downsampled image resolution should match exactly in both Matlab and torch... The reason is that sometimes there might be 1px difference in width or height (the small dimension)

if h > w
    h2 = len;
    w2 = floor(w * h2 / h);
else 
    w2 = len;
    h2 = floor(h * w2 / w);
end 

Or your can omit input = reshape_img(input, 700); in Matlab and simply load the downsampled-by-torch images to compute those *.mat.

OK thanks, will keep that in mind. Have resized all images and tried running deepmatting_seg.lua with the previous result (and the updated Laplacian matrix file) and now it seems to work as it should! Finally :-)

Iteration 1000 / 1000
  Content 1 loss: 1396670.156250
  Style 1 loss: 635.449047
  Style 2 loss: 14420.737871
  Style 3 loss: 16076.639486
  Style 4 loss: 252506.281133
  Style 5 loss: 968.837423
  Total loss: 1681278.101209
<optim.lbfgs> 	reached max number of iterations

in1

tar1

~TADAAA

best1_t_1000

The only downside is that you need a powerful GPU to get higher resolutions (at this time). Hope you find the time to implement your upscaling trick.

If i recall correctly the neural style code from jcjohnson has some parameters to reduce the needed VRAM:
https://github.com/jcjohnson/neural-style#memory-usage


Use cuDNN: Add the flag -backend cudnn to use the cuDNN backend. This will only work in GPU mode.
Use ADAM: Add the flag -optimizer adam to use ADAM instead of L-BFGS. This should significantly reduce memory usage, but may require tuning of other parameters for good results; in particular you should play with the learning rate, content weight, style weight, and also consider using gradient normalization. This should work in both CPU and GPU modes.
Reduce image size: If the above tricks are not enough, you can reduce the size of the generated image; pass the flag -image_size 256 to generate an image at half the default size.
With the default settings, neural-style uses about 3.5GB of GPU memory on my system; switching to ADAM and cuDNN reduces the GPU memory footprint to about 1GB.


Is it possible to use cudnn or the ADAM solver with your lua code?

And thanks again for your replies – it's not that common you get instant feedback on github. Dankeschön :-)

Hi, yes, the current implementation is based on that code so it should be easy to add cuDNN/ADAM flags back, I forget when I removed those during this project...

-backend cudnn and -cudnn_autotune are working fine – testing them right now. To add the ADAM option i think i just need to copy the optimizer flags and the appropriate function(s) from jcjohnson's neural_style.lua and paste them in the right place to your file.

But i think the ADAM solver produced lower quality results. And maybe you also need to load another model...
https://github.com/jcjohnson/neural-style#usage

Either the other model is just required for OpenCL or it could be that it only works for the ADAM solver.
Will look into it.

But for now i'm more than glad that everything actually works :-)

Glad to know that, thanks : )

@subzerofun

But i think the ADAM solver produced lower quality results. And maybe you also need to load another model...

That's because the ADAM optimizer needs the -learning_rate parameter adjusted so that it is greater than 1.

In Neural-Style, the -learning_rate parameter only works with ADAM. It does not have any effect on the lbfgs optimizer, which I believe has a learning rate equivalent to 10.

Either the other model is just required for OpenCL or it could be that it only works for the ADAM solver.
Will look into it.

ADAM works with all the different -backend options.

it really preaty good