Feature: Retries due to periodic failure of underlying `docker` commands (ex. `rm`)?

Question

Feature: Retries due to periodic failure of underlying `docker` commands (ex. `rm`)?

t3hmrman opened this issue a year ago · 3 comments

As always thanks for the awesome library, it's been incredibly useful for testing.

I've been doing some stress-testing on my test suite (i.e. running the tests continuously until one failed) lately and found that sometimes the Cli actually fails to perform some lower level docker CLI commands.

The first failure I encountered was a failure with creating a container, but unfortunately I didn't have --nocapture on, so I couldn't get the output. After repeating the process I found that I got a failure:

.......... TERMINATING [>120.000s] project::mod components::mod::inner::test_name_ci_serial
thread '<unnamed>' panicked at /path/to/.cargo/registry/src/index.crates.io-6f17d22bba15001f/testcontainers-0.15.0/src/clients/cli.rs:354:9:
Failed to remove docker container
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'components::mod::inner::test_name_ci_serial' panicked at /path/to/.cargo/registry/src/index.crates.io-6f17d22bba15001f/testcontainers-0.15.0/src/clients/cli.rs:354:9:
Failed to remove docker container
test components::mod::inner::test_name_ci_serial ... FAILED

failures:

failures:
    components::mod::inner::test_name_ci_serial

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 10 filtered out; finished in 120.47s

     TIMEOUT [ 120.474s] project::mod components::mod::inner::test_name_ci_serial
   Canceling due to test failure
------------
     Summary [ 188.463s] 11 tests run: 10 passed, 1 timed out, 0 skipped
     TIMEOUT [ 120.474s] project::mod components::mod::inner::test_name_ci_serial
error: test run failed
error: Recipe `test-int` failed on line 124 with exit code 100

I've anonymized the details of the project and test suite, but it should be clear that the failure was inside (but not the fault of) testcontainers.

Looking at the output of my docker systemd service, I see a failure to write stderr (emphasis via spacing added below):

Nov 24 12:14:30 host dockerd[13873]: time="2023-11-24T12:14:30.456846562+09:00" level=info msg="ignoring event" container=444c245cd67691e5ad93decc764347c145299ffc33636300ffeff89348cdbca3 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
...(many identical info messages for different containers)...

Nov 24 12:16:31 host dockerd[13873]: time="2023-11-24T12:16:31.062710838+09:00" level=error msg="Error running exec 53c30d4cedccd888d169f833fca4db8fab88d5520aadc2df6edc4affde44174b in container: exec attach failed: error attaching stderr stream: write unix /run/docker.sock->@: write: broken pipe"

Nov 24 12:16:31 host dockerd[13873]: time="2023-11-24T12:16:31.094747073+09:00" level=info msg="ignoring event" container=622b59590b33913f5ecbc7e93d01bad6a2fe4b610cda63f9c12a42efa79d9a18 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
...(many identical info messages for different containers)...

After this error came up I restarted the test suite and it worked just fine -- the lower level failure seems transient.

Does it make sense to add error detection and/or a dumb retry policy at this level to the underlying client? I'm not sure if there's a better way to handle this, and unfortunately I didn't increase the log level on docker so it wasn't more specific on why it failed (like it has been for others).

Answer 1 · 2024-01-15T03:06:31.000Z

Are you running your tests concurrently? I wouldn't be surprised if there are race conditions within the docker CLI. I ran into some myself. You can try using the experimental HTTP client that talks to the docker daemon directly.

Answer 2 · 2024-01-15T14:09:10.000Z

Ah, so if this is a known issue then is the way to resolve this just to add some documentation to recommend switching to the experimental HTTP client method for now?

Answer 3 · 2024-01-15T17:04:10.000Z

Ah, so if this is a known issue then is the way to resolve this just to add some documentation to recommend switching to the experimental HTTP client method for now?

I didn't say that it is a known issue but you can try the experimental client to narrow down which component causes the issue.