NLKNguyen/alpine-mpich

Host detection not finding second container

rcplane opened this issue · 3 comments

Hello, I have recently been trying to use this project on two Docker Engine host VMs in a private cloud. However the get_hosts script running netstat -t does not discover any other containers and executing mpirun hostname only shows the mpi master container.

For testing purposes I used the cluster setup recommended Docker and Docker Compose versions, opened all ports between the Docker Engine hosts on the private cloud, and was able to run a Docker Swarm service with multiple basic nginx image containers on each host attached to a Docker Swarm overlay network - much like your setup. I was also able to attach a second Docker Swarm service using the nginx image to the same swarm overlay network and I observed all of these containers to communicate fine. In particular within these nginx containers I was able to ping, curl, and telnet between containers using the overlay network ip addresses for each container which can be found by running docker network list and then docker network ps network_id on each Docker host VM. When I try installing these utilities and running ping and telnet between containers running my private build of alpine-mpich I get no response. As I understand it containers for services attached to a swarm overlay network should be able to communicate freely without specifying additional ports, but should docker ps on either of two Docker host VMs show that the master or worker alpine-mpich container is using port 22/tcp?

I would appreciate any help debugging or setup advice you can provide.

I will look into this.
Yes, there's no need to expose additional ports other than the port to SSH login to the master node.
Have you tried using the pre-built alpine-mpich directly instead? I'd like to know if it behaves normally like the screencast.

Thank you for the feedback. Upon inspection of /var/log/syslog I discovered that I had not actually opened all of the required ports for Docker swarm to function. When I tried the pre-built container I saw logged errors about mpi-bootstrap not found on the system path. After opening those ports, a local build of the alpine-mpich container was able to run just like the screencast, awesome!

Very nice!! Glad that you were able to debug this issue and everything is sorted out. 👍
I actually haven't configured a private cloud myself. I only provision Docker hosts on DigitalOcean using the provided driver, and the ports seem to be open correctly to create a functional Swarm, so I haven't encountered problems in that regard. The information you provided will be beneficial to me and others who get a similar problem in the future. Thanks 😃