testcontainers/testcontainers-python

Bug: Ryuk fails to start due to port binding (colima, timing)

sondr3 opened this issue · 17 comments

Describe the bug

I upgraded from 3.5.0 to 4.1.0 and the container itself fails to spawn because the Ryuk container setup fails. I've tried debugging the issue and it looks like it is trying to bind the port exposed on IPv6 to the port on IPv4 (the container_port variable is correct for IPv4), which are for some reason different ports.

$ docker ps
CONTAINER ID   IMAGE                       COMMAND       CREATED         STATUS         PORTS                                         NAMES
4f1bad20a38c   testcontainers/ryuk:0.5.1   "/bin/ryuk"   7 seconds ago   Up 5 seconds   0.0.0.0:33029->8080/tcp, :::32775->8080/tcp   testcontainers-ryuk-1cf580e2-54c4-496d-a1d7-17f495911219

To Reproduce

Provide a self-contained code snippet that illustrates the bug or unexpected behavior. Ideally, send a Pull Request to illustrate with a test that illustrates the problem.

>       Reaper._socket.connect((container_host, container_port))
E       ConnectionRefusedError: [Errno 61] Connection refused

Runtime environment

Provide a summary of your runtime environment. Which operating system, python version, and docker version are you using? What is the version of testcontainers-python you are using? You can run the following commands to get the relevant information.

$ uname -a
Darwin jupiter.local 23.3.0 Darwin Kernel Version 23.3.0: Wed Dec 20 21:30:44 PST 2023; root:xnu-10002.81.5~7/RELEASE_ARM64_T6000 arm64
$ python --version
Python 3.11.7
$ docker info
gine - Community
 Version:    25.0.4
 Context:    colima
 Debug Mode: false
 Plugins:
  compose: Docker Compose (Docker Inc.)
    Version:  2.25.0
    Path:     /Users/sondre/.docker/cli-plugins/docker-compose

Server:
 Containers: 4
  Running: 0
  Paused: 0
  Stopped: 4
 Images: 112
 Server Version: 24.0.7
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 61f9fd88f79f081d64d6fa3bb1a0dc71ec870523
 runc version: v1.1.9-0-gccaecfc
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 6.5.0-21-generic
 Operating System: Ubuntu 23.10
 OSType: linux
 Architecture: aarch64
 CPUs: 2
 Total Memory: 3.817GiB
 Name: colima
 ID: ac2c6903-b356-409d-9301-b040440d1efd
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

i cant reproduce on my m1 environment, but after what i saw on ipv6 working with compose i have no doubt there is a potential for an issue here.

Hi @sondr3!

Sorry to hear that you are having problems. I am on an M3 setup myself, but haven't encountered the same problem that you have. Am I reading it right that you are using colima with a x86 VM as your Docker runtime on an M1 / arm64 system?

Do you encounter the same problem if you run with a native arm64 backend without virtualization/colima?

We'll follow you up closely on this one, as Ryuk is important for us to run smoothly on all architectures.

@santi, correct. However, I can't run the image natively, I need to run the MSSQL Docker image for tests at $WORK and it only has amd64 images available :(

$ colima status
INFO[0000] colima is running using macOS Virtualization.Framework 
INFO[0000] arch: aarch64                                
INFO[0000] runtime: docker                              
INFO[0000] mountType: virtiofs                          
INFO[0000] socket: unix:///Users/sondre/.colima/default/docker.sock 

Ah, try using the mcr.microsoft.com/azure-sql-edge:1.0.7 image instead. It has ARM64 support and the API is compatible with mssql (Note: I haven't tried it extensively, only used as part of testing in this repo).

This doesn't really solve your problem, but worth a try:

import sqlalchemy
from testcontainers.mssql import SqlServerContainer

with SqlServerContainer("mcr.microsoft.com/azure-sql-edge:1.0.7") as mssql:
    engine = sqlalchemy.create_engine(mssql.get_connection_url())
    with engine.begin() as connection:
        result = connection.execute(sqlalchemy.text("select @@VERSION"))

Using that image work without emulating amd64, but ryuk still fails to start sadly. Interestingly it mostly works when I try to debug and step through, so it may be a timing issue? Not really sure, it works maybe 1 in 5 attempts.

update: I've run the tests a bunch of times on our Ubuntu 22.04 CI machines and it works fine there and on my colleagues Windows machine 🙈

Having digged further into this, I strongly believe port bindings are not to blame for this problem. The 0.0.0.0:33029->8080/tcp, :::32775->8080/tcp of your docker ps output indicates that port 33029 on IPv4 and port 32775 on IPv6 are mapped to port 8080 on the inside of your container on their respective IP interfaces. Nothing wrong with that.

If this behavior appears randomly in some cases and consistently when using breakpoint(), I agree it more likely looks like a timing issue. The mysterious thing is that the wait strategy for the Ryuk container is identical to the wait strategy in the Java implementation, which doesn't report the same problem. At the point of the ConnectionRefusedError, are you 100% sure the Ryuk container is running at all? Only case I can think about is that the RYUK_RECONNECTION_TIMEOUT is set so low that Ryuk terminates before a socket is connected. Could you try updating to the latest release (4.3.1) and setting an env variable as RYUK_RECONNECTION_TIMEOUT=30s?

I experienced the same problem on a Mac. Downgrading testcontainers (4.3.2 -> 3.7.1) fixed the issue.

The same happened to me. Fixed downgrading it to 3.7.1.
Why Ryuk container is not running during tests in the 3.7.1 version? It seems testcontainers >= 4 now uses ryuk in test execution

RYUK_RECONNECTION_TIMEOUT=30s doesn't do anything for me on 4.3.3 but setting a breakpoint here and pausing for a split second reliably works.

sounds like ports become available later on colima, and so we'd want to actually check those and not just wait on logs, if we wanted to be compatible with colima's differences with docker

does this tweak to retry for ~20 seconds help?

pip install git+https://github.com/testcontainers/testcontainers-python.git@issue486_explore_retry

It doesn't because an unhandled OSError gets thrown.

Simply handling the OSError doesn't help either. I get these exceptions:

[Errno 61] Connection refused
[Errno 22] Invalid argument
[Errno 22] Invalid argument
...
[Errno 22] Invalid argument

Something like this appears to resolve it but I don't know enough about this library, Python, or sockets to know if it's the correct approach.

        last_connection_exception: Optional[OSError] = None
        for _ in range(50):
            try:
                Reaper._socket = socket()
                Reaper._socket.connect((container_host, container_port))
                last_connection_exception = None
                break
            except OSError as e:
                last_connection_exception = e
                from time import sleep

                sleep(0.5)
        if last_connection_exception:
            raise last_connection_exception

@pseidel-kcf thanks for testing, ive updated my branch - i think from the perspective of maintenance of this library the missing insights are into colima - per the hypothesis that this is a colima timing bug (well bug in the sense its not matching the behavior of docker engine), this approach could be the one to go with

Thanks @alexanderankin. I didn't do a great job explaining but I found that I needed to recreate the socket in addition to handling the exception type.

rvem commented

which are for some reason different ports.

Looks like another instance of moby/moby#42442

I had the same issue of ConnectionRefused on linux via Rancher Desktop (colima based), and using the issue486_explore_retry branch it's fixed for me.

Before I saw this thread, I investigated using a new breakpoint() around the socket connection, and even waiting <0.5s to continue, this fixed the connection refused, same as others above.
So I' m strongly leaning towards a timing issue in colima (behaviour deviating from docker engine), especially since the branch linked fixed it for me.

Note that the main branch fails most tests otherwise, due to missing Ryuk connection, and this ever since 4.1 introduced it! Same on work's Intel macbooks, testcontainers-py >=4.1 via Rancher Desktop is a no-go for us, had to pin to <=3.7

Suggest polishing this timing/retry branch, and consider merging, if it proves a good compromise.

alright im going to merge the associated PR - this will close this issue. please try the next release (4.4.0) when its released in couple mins

and re-open/comment (and we'll reopen) this issue if needed.