Segmentation fault for all examples with python3
Nqabz opened this issue · 6 comments
I tried testing all your examples but keep on running into segmentation faults when using 1,2,3,4,.., 8 GPUs. Is there any resolve to this? I see you had an earlier bug related to segmentation faults. Here is my fault trace:
Theano-MPI started 2 workers for
1.updating Cifar10_model params through iterations and
2.exchange the params with EASGD
See output log.
cluster3.31164hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out
cluster3.31163hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out
cluster3.31165hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out
Using cuDNN version 5005 on context None
Mapped name None to device cuda2: Tesla K80 (0000:08:00.0)
INFO (theano.gof.compilelock): Waiting for existing lock by process '31163' (I am process '31164')
INFO (theano.gof.compilelock): To manually release the lock, delete /home/.theano/compiledir_Linux-3.10-el7.x86_64-x86_64-with-centos-7.3.1611-Core-x86_64-3.4.2-64/lock_dir
Using cuDNN version 5005 on context None
Mapped name None to device cuda0: Tesla K80 (0000:04:00.0)
Using Theano backend.
input shape is: (3, 32, 32, 256)
subtract shape is: (3, 32, 32, 1)
center margin is: 0
crop size is: 32
flag_on is: <GpuArrayType<None>(float32, ())>
[cluster3:31163] Process received signal
[cluster3:31163] Signal: Segmentation fault (11)
[cluster3:31163] Signal code: (128)
[cluster3:31163] Failing at address: (nil)
[cluster3:31163] [ 0] /usr/lib64/libpthread.so.0(+0xf370)[0x7fd4390c8370]
[cluster3:31163] [ 1] /usr/local/lib/libgpuarray.so.2(gpukernel_release+0xa)[0x7fd40c8c898a]
[cluster3:31163] [ 2] /usr/local/lib/libgpuarray.so.2(GpuKernel_clear+0x11)[0x7fd40c8d2131]
[cluster3:31163] [ 3] /usr/local/lib/libgpuarray.so.2(GpuKernel_init+0xb0)[0x7fd40c8d2200]
[cluster3:31163] [ 4] /home/dlq/.theano/compiledir_Linux-3.10-el7.x86_64-x86_64-with-centos-7.3.1611-Core-x86_64-3.4.2-64/tmp7kzja1uh/m7a35c53365410a3b80c5389af5d2afa5.so(+0x190a)[0x7fd3e6bde90a]
[cluster3:31163] [ 5] /usr/local/lib/libpython3.4m.so.1.0(PyEval_EvalFrameEx+0x5bd3)[0x7fd43941c253]
[cluster3:31163] [ 6] /usr/local/lib/libpython3.4m.so.1.0(PyEval_EvalCodeEx+0x882)[0x7fd43941f0f2]
[cluster3:31163] [20] /usr/local/lib/libpython3.4m.so.1.0(PyEval_EvalCodeEx+0x882)[0x7fd43941f0f2]
[cluster3:31163] [21] /usr/local/lib/libpython3.4m.so.1.0(PyEval_EvalFrameEx+0x5d13)[0x7fd43941c393]
[cluster3:31163] [22] /usr/local/lib/libpython3.4m.so.1.0(PyEval_EvalCodeEx+0x882)[0x7fd43941f0f2]
[cluster3:31163] [23] /usr/local/lib/libpython3.4m.so.1.0(PyEval_EvalFrameEx+0x5d13)[0x7fd43941c393]
[cluster3:31163] [24] /usr/local/lib/libpython3.4m.so.1.0(PyEval_EvalCodeEx+0x882)[0x7fd43941f0f2]
[cluster3:31163] [25] /usr/local/lib/libpython3.4m.so.1.0(PyEval_EvalFrameEx+0x5d13)[0x7fd43941c393]
[cluster3:31163] [26] /usr/local/lib/libpython3.4m.so.1.0(PyEval_EvalCodeEx+0x882)[0x7fd43941f0f2]
[cluster3:31163] [27] /usr/local/lib/libpython3.4m.so.1.0(PyEval_EvalFrameEx+0x5d13)[0x7fd43941c393]
[cluster3:31163] [28] /usr/local/lib/libpython3.4m.so.1.0(PyEval_EvalCodeEx+0x882)[0x7fd43941f0f2]
[cluster3:31163] [29] /usr/local/lib/libpython3.4m.so.1.0(PyEval_EvalFrameEx+0x5d13)[0x7fd43941c393]
[gist-smapper3:31163] End of error message
Using cuDNN version 5005 on context None
Mapped name None to device cuda1: Tesla K80 (0000:05:00.0)
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 31163 on node gist-smapper3 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Rule session 31159 terminated with return code: 139. ```
It seems the Python3 support for Theano-MPI is still experimental but python2 should work since it was developed based on python2. I tried installing anaconda python3.6 and tested it using the test_model.py
script:
$ cd theanompi/models/
$ python3 test_model.py cifar10 Cifar10_model
Using cuDNN version 5110 on context None
Mapped name None to device cuda0: GeForce GTX TITAN Black (0000:03:00.0)
rank0: bad list is [], extended to 156
rank0: bad list is [], extended to 39
Cifar10_model
Layer Subtract in (3, 32, 32, 256) --> out (3, 32, 32, 256)
Layer Crop in [ 3 32 32 256] --> out (3, 28, 28, 256)
Layer Dimshuffle in [ 3 28 28 256] --> out (256, 3, 28, 28)
Layer Conv (cudnn) in [256 3 28 28] --> out (256, 64, 24, 24)
Layer Pool in [256 64 24 24] --> out (256, 64, 12, 12)
Layer Conv (cudnn) in [256 64 12 12] --> out (256, 128, 8, 8)
Layer Pool in [256 128 8 8] --> out (256, 128, 4, 4)
Layer Conv (cudnn) in [256 128 4 4] --> out (256, 64, 2, 2)
Layer Flatten in [256 64 2 2] --> out (256, 256)
Layer FC in [256 256] --> out (256, 256)
Layer Dropout0.5 in [256 256] --> out (256, 256)
Layer Softmax in [256 256] --> out (256, 10)
[64 3 5 5]
[64]
[128 64 5 5]
[128]
[ 64 128 3 3]
[64]
[256 256]
[256]
[256 10]
[10]
model size 0.336 M floats
compiling training function...
[GPU8:155335] *** Process received signal ***
[GPU8:155335] Signal: Segmentation fault (11)
[GPU8:155335] Signal code: (128)
[GPU8:155335] Failing at address: (nil)
[GPU8:155335] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x10330)[0x7f9e7a4bd330]
[GPU8:155335] [ 1] /export/mlrg/hma02/anaconda2-x86_64/lib/python2.7/site-packages/pygpu/../../../libgpuarray.so.2(gpukernel_release+0xa)[0x7f9e555b554a]
[GPU8:155335] [ 2] /export/mlrg/hma02/anaconda2-x86_64/lib/python2.7/site-packages/pygpu/../../../libgpuarray.so.2(GpuKernel_clear+0x11)[0x7f9e555bf571]
[GPU8:155335] [ 3] /export/mlrg/hma02/anaconda2-x86_64/lib/python2.7/site-packages/pygpu/../../../libgpuarray.so.2(GpuKernel_init+0xe8)[0x7f9e555bf678]
[GPU8:155335] [ 4] /export/mlrg/hma02/.theano/compiledir_Linux-3.13--generic-x86_64-with-debian-jessie-sid-x86_64-2.7.13-64/tmpGRA547/52e419a14bf3ba72bf7b2d47176d6a81.so(+0x17da)[0x7f9e30d0c7da]
[GPU8:155335] [ 5] /export/mlrg/hma02/anaconda2-x86_64/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x84fd)[0x7f9e7a7c8bad]
[GPU8:155335] [ 6] /export/mlrg/hma02/anaconda2-x86_64/bin/../lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x89e)[0x7f9e7a7c9c3e]
[GPU8:155335] [ 7] /export/mlrg/hma02/anaconda2-x86_64/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x8b47)[0x7f9e7a7c91f7]
[GPU8:155335] [ 8] /export/mlrg/hma02/anaconda2-x86_64/bin/../lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x89e)[0x7f9e7a7c9c3e]
[GPU8:155335] [ 9] /export/mlrg/hma02/anaconda2-x86_64/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x8b47)[0x7f9e7a7c91f7]
[GPU8:155335] [10] /export/mlrg/hma02/anaconda2-x86_64/bin/../lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x89e)[0x7f9e7a7c9c3e]
[GPU8:155335] [11] /export/mlrg/hma02/anaconda2-x86_64/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x8b47)[0x7f9e7a7c91f7]
[GPU8:155335] [12] /export/mlrg/hma02/anaconda2-x86_64/bin/../lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x89e)[0x7f9e7a7c9c3e]
[GPU8:155335] [13] /export/mlrg/hma02/anaconda2-x86_64/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x8b47)[0x7f9e7a7c91f7]
[GPU8:155335] [14] /export/mlrg/hma02/anaconda2-x86_64/bin/../lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x89e)[0x7f9e7a7c9c3e]
[GPU8:155335] [15] /export/mlrg/hma02/anaconda2-x86_64/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x8b47)[0x7f9e7a7c91f7]
[GPU8:155335] [16] /export/mlrg/hma02/anaconda2-x86_64/bin/../lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x89e)[0x7f9e7a7c9c3e]
[GPU8:155335] [17] /export/mlrg/hma02/anaconda2-x86_64/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x8b47)[0x7f9e7a7c91f7]
[GPU8:155335] [18] /export/mlrg/hma02/anaconda2-x86_64/bin/../lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x89e)[0x7f9e7a7c9c3e]
[GPU8:155335] [19] /export/mlrg/hma02/anaconda2-x86_64/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x8b47)[0x7f9e7a7c91f7]
[GPU8:155335] [20] /export/mlrg/hma02/anaconda2-x86_64/bin/../lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x89e)[0x7f9e7a7c9c3e]
[GPU8:155335] [21] /export/mlrg/hma02/anaconda2-x86_64/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x8b47)[0x7f9e7a7c91f7]
[GPU8:155335] [22] /export/mlrg/hma02/anaconda2-x86_64/bin/../lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x89e)[0x7f9e7a7c9c3e]
[GPU8:155335] [23] /export/mlrg/hma02/anaconda2-x86_64/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x8b47)[0x7f9e7a7c91f7]
[GPU8:155335] [24] /export/mlrg/hma02/anaconda2-x86_64/bin/../lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x89e)[0x7f9e7a7c9c3e]
[GPU8:155335] [25] /export/mlrg/hma02/anaconda2-x86_64/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x8b47)[0x7f9e7a7c91f7]
[GPU8:155335] [26] /export/mlrg/hma02/anaconda2-x86_64/bin/../lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x89e)[0x7f9e7a7c9c3e]
[GPU8:155335] [27] /export/mlrg/hma02/anaconda2-x86_64/bin/../lib/libpython2.7.so.1.0(+0x79b68)[0x7f9e7a744b68]
[GPU8:155335] [28] /export/mlrg/hma02/anaconda2-x86_64/bin/../lib/libpython2.7.so.1.0(PyObject_Call+0x53)[0x7f9e7a714e93]
[GPU8:155335] [29] /export/mlrg/hma02/anaconda2-x86_64/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x61d6)[0x7f9e7a7c6886]
[GPU8:155335] *** End of error message ***
Segmentation fault (core dumped)
$
I got segfault when compiling training function as well.
I tried on both ppc64le and x86_64 systems following the steps below:
- I have a working openmpi (version 1.8.8) and then installed anaconda python3.
- git clone mpi4py and build it against my openmpi (you have to check
which mpirun
and set the corresponding path in the openmpi section of mpi.cfg, then run python3 setup.py build --mpi=openmpi, then pip3 install -U .) - conda install pygpu
- conda install theano
- git clone hickle, cd hickle, git checkout dev and pip3 install -U .
- git clone theanompi and pip3 install -U .
The test passes with python3 on ppc64le but not on x86_64. The libpython3.6 on x86_64 seems having problems with libgpuarray from the traceback, and the test passes with python2 on both architectures.
If you need to try out ideas with Theano-MPI, I recommend starting with python2 before the python3 support is complete.
Thanks for the checking. I will try on Python2.7
I just tried upgrading my theano to the bleeding-edge version. You just need to change the step 4 in the upper mentioned steps to
- git clone theano, cd theano and pip3 install -U .
I tested the bsp example and it's working now with python3:
$ python3 test_bsp.py
Theano-MPI started 2 workers for
1.updating Cifar10_model params through iterations and
2.exchange the params with BSP(cdd,nccl32)
See output log.
Using cuDNN version 5110 on context None
Mapped name None to device cuda0: GeForce GTX TITAN Black (0000:83:00.0)
Using cuDNN version 5110 on context None
Mapped name None to device cuda1: GeForce GTX TITAN (0000:04:00.0)
rank0: bad list is [], extended to 156
rank0: bad list is [38], extended to 40
Cifar10_model
Layer Subtract in (3, 32, 32, 256) --> out (3, 32, 32, 256)
Layer Crop in [ 3 32 32 256] --> out (3, 28, 28, 256)
Layer Dimshuffle in [ 3 28 28 256] --> out (256, 3, 28, 28)
Layer Conv (cudnn) in [256 3 28 28] --> out (256, 64, 24, 24)
Layer Pool in [256 64 24 24] --> out (256, 64, 12, 12)
Layer Conv (cudnn) in [256 64 12 12] --> out (256, 128, 8, 8)
Layer Pool in [256 128 8 8] --> out (256, 128, 4, 4)
Layer Conv (cudnn) in [256 128 4 4] --> out (256, 64, 2, 2)
Layer Flatten in [256 64 2 2] --> out (256, 256)
Layer FC in [256 256] --> out (256, 256)
Layer Dropout0.5 in [256 256] --> out (256, 256)
Layer Softmax in [256 256] --> out (256, 10)
[64 3 5 5]
[64]
[128 64 5 5]
[128]
[ 64 128 3 3]
[64]
[256 256]
[256]
[256 10]
[10]
model size 0.336 M floats
compiling training function...
INFO (theano.gof.compilelock): Waiting for existing lock by process '188888' (I am process '188887')
INFO (theano.gof.compilelock): To manually release the lock, delete /export/mlrg/hma02/.theano/compiledir_Linux-3.13--generic-x86_64-with-debian-jessie-sid-x86_64-3.6.1-64/lock_dir
compiling validation function...
Compile time: 39.133 s
40 2.206262 0.819336
time per 40 batches: 0.85 (train 0.38 comm 0.39 wait 0.08)