openai/glow

mpiexec hangs on creating pad

0xymoro opened this issue · 1 comments

Hi, quick issue with mpiexec. Without it the program runs fine with 1 gpu (am running Horovod within a Docker container), but mpiexec hangs whenever it's invoked.

I ran a strace and it hangs after this sequence of creating pads; any hints would be appreciated!

write(1, "Creating pad 1_1_6_6\n", 21) = 21 poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=23, events=POLLIN}, {fd=30, events=POLLIN}, {fd=28, events=POLLIN}, {fd=0, events=POLLIN}, {fd=32, events=POLLIN}, {fd=24, events=POLLIN}], 9, -1) = 1 ([{fd=24, revents=POLLIN}]) read(24, "Creating pad 1_1_4_4\n", 4096) = 21 poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=23, events=POLLIN}, {fd=30, events=POLLIN}, {fd=28, events=POLLIN}, {fd=0, events=POLLIN}, {fd=32, events=POLLIN}, {fd=24, events=POLLIN}], 9, 0) = 0 (Timeout) write(1, "Creating pad 1_1_4_4\n", 21) = 21 poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=23, events=POLLIN}, {fd=30, events=POLLIN}, {fd=28, events=POLLIN}, {fd=0, events=POLLIN}, {fd=32, events=POLLIN}, {fd=24, events=POLLIN}], 9, -1

Hi, quick issue with mpiexec. Without it the program runs fine with 1 gpu (am running Horovod within a Docker container), but mpiexec hangs whenever it's invoked.

I ran a strace and it hangs after this sequence of creating pads; any hints would be appreciated!

write(1, "Creating pad 1_1_6_6\n", 21) = 21 poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=23, events=POLLIN}, {fd=30, events=POLLIN}, {fd=28, events=POLLIN}, {fd=0, events=POLLIN}, {fd=32, events=POLLIN}, {fd=24, events=POLLIN}], 9, -1) = 1 ([{fd=24, revents=POLLIN}]) read(24, "Creating pad 1_1_4_4\n", 4096) = 21 poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=23, events=POLLIN}, {fd=30, events=POLLIN}, {fd=28, events=POLLIN}, {fd=0, events=POLLIN}, {fd=32, events=POLLIN}, {fd=24, events=POLLIN}], 9, 0) = 0 (Timeout) write(1, "Creating pad 1_1_4_4\n", 21) = 21 poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=23, events=POLLIN}, {fd=30, events=POLLIN}, {fd=28, events=POLLIN}, {fd=0, events=POLLIN}, {fd=32, events=POLLIN}, {fd=24, events=POLLIN}], 9, -1

Have you solved this issue?