Bug: Ryuk fails to start due to port binding (colima, timing)
sondr3 opened this issue · 17 comments
Describe the bug
I upgraded from 3.5.0 to 4.1.0 and the container itself fails to spawn because the Ryuk container setup fails. I've tried debugging the issue and it looks like it is trying to bind the port exposed on IPv6 to the port on IPv4 (the container_port
variable is correct for IPv4), which are for some reason different ports.
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
4f1bad20a38c testcontainers/ryuk:0.5.1 "/bin/ryuk" 7 seconds ago Up 5 seconds 0.0.0.0:33029->8080/tcp, :::32775->8080/tcp testcontainers-ryuk-1cf580e2-54c4-496d-a1d7-17f495911219
To Reproduce
Provide a self-contained code snippet that illustrates the bug or unexpected behavior. Ideally, send a Pull Request to illustrate with a test that illustrates the problem.
> Reaper._socket.connect((container_host, container_port))
E ConnectionRefusedError: [Errno 61] Connection refused
Runtime environment
Provide a summary of your runtime environment. Which operating system, python version, and docker version are you using? What is the version of testcontainers-python
you are using? You can run the following commands to get the relevant information.
$ uname -a
Darwin jupiter.local 23.3.0 Darwin Kernel Version 23.3.0: Wed Dec 20 21:30:44 PST 2023; root:xnu-10002.81.5~7/RELEASE_ARM64_T6000 arm64
$ python --version
Python 3.11.7
$ docker info
gine - Community
Version: 25.0.4
Context: colima
Debug Mode: false
Plugins:
compose: Docker Compose (Docker Inc.)
Version: 2.25.0
Path: /Users/sondre/.docker/cli-plugins/docker-compose
Server:
Containers: 4
Running: 0
Paused: 0
Stopped: 4
Images: 112
Server Version: 24.0.7
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Using metacopy: false
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Cgroup Version: 2
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 61f9fd88f79f081d64d6fa3bb1a0dc71ec870523
runc version: v1.1.9-0-gccaecfc
init version: de40ad0
Security Options:
apparmor
seccomp
Profile: builtin
cgroupns
Kernel Version: 6.5.0-21-generic
Operating System: Ubuntu 23.10
OSType: linux
Architecture: aarch64
CPUs: 2
Total Memory: 3.817GiB
Name: colima
ID: ac2c6903-b356-409d-9301-b040440d1efd
Docker Root Dir: /var/lib/docker
Debug Mode: false
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
i cant reproduce on my m1 environment, but after what i saw on ipv6 working with compose i have no doubt there is a potential for an issue here.
Hi @sondr3!
Sorry to hear that you are having problems. I am on an M3 setup myself, but haven't encountered the same problem that you have. Am I reading it right that you are using colima
with a x86 VM as your Docker runtime on an M1 / arm64 system?
Do you encounter the same problem if you run with a native arm64 backend without virtualization/colima?
We'll follow you up closely on this one, as Ryuk is important for us to run smoothly on all architectures.
@santi, correct. However, I can't run the image natively, I need to run the MSSQL Docker image for tests at $WORK and it only has amd64
images available :(
$ colima status
INFO[0000] colima is running using macOS Virtualization.Framework
INFO[0000] arch: aarch64
INFO[0000] runtime: docker
INFO[0000] mountType: virtiofs
INFO[0000] socket: unix:///Users/sondre/.colima/default/docker.sock
Ah, try using the mcr.microsoft.com/azure-sql-edge:1.0.7
image instead. It has ARM64 support and the API is compatible with mssql (Note: I haven't tried it extensively, only used as part of testing in this repo).
This doesn't really solve your problem, but worth a try:
import sqlalchemy
from testcontainers.mssql import SqlServerContainer
with SqlServerContainer("mcr.microsoft.com/azure-sql-edge:1.0.7") as mssql:
engine = sqlalchemy.create_engine(mssql.get_connection_url())
with engine.begin() as connection:
result = connection.execute(sqlalchemy.text("select @@VERSION"))
Using that image work without emulating amd64
, but ryuk
still fails to start sadly. Interestingly it mostly works when I try to debug and step through, so it may be a timing issue? Not really sure, it works maybe 1 in 5 attempts.
update: I've run the tests a bunch of times on our Ubuntu 22.04 CI machines and it works fine there and on my colleagues Windows machine 🙈
Having digged further into this, I strongly believe port bindings are not to blame for this problem. The 0.0.0.0:33029->8080/tcp, :::32775->8080/tcp
of your docker ps
output indicates that port 33029
on IPv4 and port 32775
on IPv6 are mapped to port 8080
on the inside of your container on their respective IP interfaces. Nothing wrong with that.
If this behavior appears randomly in some cases and consistently when using breakpoint()
, I agree it more likely looks like a timing issue. The mysterious thing is that the wait strategy for the Ryuk container is identical to the wait strategy in the Java implementation, which doesn't report the same problem. At the point of the ConnectionRefusedError
, are you 100% sure the Ryuk container is running at all? Only case I can think about is that the RYUK_RECONNECTION_TIMEOUT
is set so low that Ryuk terminates before a socket is connected. Could you try updating to the latest release (4.3.1) and setting an env variable as RYUK_RECONNECTION_TIMEOUT=30s
?
I experienced the same problem on a Mac. Downgrading testcontainers (4.3.2 -> 3.7.1) fixed the issue.
The same happened to me. Fixed downgrading it to 3.7.1.
Why Ryuk container is not running during tests in the 3.7.1 version? It seems testcontainers >= 4 now uses ryuk in test execution
RYUK_RECONNECTION_TIMEOUT=30s
doesn't do anything for me on 4.3.3 but setting a breakpoint here and pausing for a split second reliably works.
sounds like ports become available later on colima, and so we'd want to actually check those and not just wait on logs, if we wanted to be compatible with colima's differences with docker
does this tweak to retry for ~20 seconds help?
pip install git+https://github.com/testcontainers/testcontainers-python.git@issue486_explore_retry
It doesn't because an unhandled OSError gets thrown.
Simply handling the OSError doesn't help either. I get these exceptions:
[Errno 61] Connection refused
[Errno 22] Invalid argument
[Errno 22] Invalid argument
...
[Errno 22] Invalid argument
Something like this appears to resolve it but I don't know enough about this library, Python, or sockets to know if it's the correct approach.
last_connection_exception: Optional[OSError] = None
for _ in range(50):
try:
Reaper._socket = socket()
Reaper._socket.connect((container_host, container_port))
last_connection_exception = None
break
except OSError as e:
last_connection_exception = e
from time import sleep
sleep(0.5)
if last_connection_exception:
raise last_connection_exception
@pseidel-kcf thanks for testing, ive updated my branch - i think from the perspective of maintenance of this library the missing insights are into colima - per the hypothesis that this is a colima timing bug (well bug in the sense its not matching the behavior of docker engine), this approach could be the one to go with
Thanks @alexanderankin. I didn't do a great job explaining but I found that I needed to recreate the socket in addition to handling the exception type.
which are for some reason different ports.
Looks like another instance of moby/moby#42442
I had the same issue of ConnectionRefused on linux via Rancher Desktop (colima based), and using the issue486_explore_retry
branch it's fixed for me.
Before I saw this thread, I investigated using a new breakpoint()
around the socket connection, and even waiting <0.5s to continue, this fixed the connection refused, same as others above.
So I' m strongly leaning towards a timing issue in colima (behaviour deviating from docker engine), especially since the branch linked fixed it for me.
Note that the main
branch fails most tests otherwise, due to missing Ryuk connection, and this ever since 4.1 introduced it! Same on work's Intel macbooks, testcontainers-py >=4.1 via Rancher Desktop is a no-go for us, had to pin to <=3.7
Suggest polishing this timing/retry branch, and consider merging, if it proves a good compromise.
alright im going to merge the associated PR - this will close this issue. please try the next release (4.4.0) when its released in couple mins
and re-open/comment (and we'll reopen) this issue if needed.