kudulab/dojo

Infrequent Dojo crash when using docker-compose driver

tomzo opened this issue · 3 comments

tomzo commented

Dojo process crashes from time to time (around 1 in 100) when ran with docker-compose. This causes the some of the containers created by docker-compose to stay running on the CI agent. (Because after the crash, there is nothing to clean them up).

It seems that there are 2 things wrong here:

  1. docker-compose ps is called before before _app_1 was already created. I suppose the ps is part of the background monitoring process to check if containers are running. But perhaps it's kicking-in too early...
  2. Exit status: 1 from docker-compose is causing the Dojo process to crash. That should just never happen.

Logs from CI:

2020/12/01 15:15:53 [ 1]  INFO: (main.main) Dojo version 0.7.0
2020/12/01 15:15:53 [ 4]  INFO: (main.DockerComposeDriver.HandleRun) docker-compose run command will be:
 docker-compose -f docker-compose-dtest.yml -f docker-compose-dtest.yml.dojo -p dojo-******** run --rm -T default "./tasks _test_docker"
Creating network "dojo-***********_default" with the default driver
Pulling app (*************.amazonaws.com/**********)...
2bedf8f: Pulling from *****/app
Creating ***********_db_1 ... 
Creating ***********_db_1 ... done
Creating ****************************_app_1 ... 
panic: Unexpected exit status:
Command: docker-compose -f docker-compose-dtest.yml -f docker-compose-dtest.yml.dojo -p dojo-********* ps
  Exit status: 1
  StdOut: <empty string>
  StdErr: No such container: ded74698ff6c7539c16d506ac0d05a8ccc1884e8da4a3030f50c8b68d2de63a2


goroutine 20 [running]:
main.DockerComposeDriver.getDCContainersNames(0x535e80, 0xc00006e0c0, 0x5367e0, 0xc000062300, 0xc000062300, 0xc0000601e0, 0x510825, 0x3, 0x7ffd4fef2a4a, 0xe, ...)
	/dojo/work/src/dojo/docker_compose_driver.go:601 +0x7f3
main.DockerComposeDriver.waitForContainersToBeRunning(0x535e80, 0xc00006e0c0, 0x5367e0, 0xc000062300, 0xc000062300, 0xc0000601e0, 0x510825, 0x3, 0x7ffd4fef2a4a, 0xe, ...)
	/dojo/work/src/dojo/docker_compose_driver.go:237 +0x170
main.DockerComposeDriver.watchContainers(0x535e80, 0xc00006e0c0, 0x5367e0, 0xc000062300, 0xc000062300, 0xc0000601e0, 0x510825, 0x3, 0x7ffd4fef2a4a, 0xe, ...)
	/dojo/work/src/dojo/docker_compose_driver.go:270 +0x1d6
created by main.DockerComposeDriver.HandleRun
	/dojo/work/src/dojo/docker_compose_driver.go:390 +0x5ec
Creating  ****************************_app_1  ... done
xmik commented

Workaround released in Dojo 0.10.3, however I couldn't reproduce this error.

xmik commented

Reproduced on CircleCI here: https://app.circleci.com/pipelines/github/kudulab/dojo/51/workflows/43161f70-6d0f-40c0-9a63-59e17e21b965/jobs/175 using commit 5a344fe

Log messages:

DEBUG: (main.DockerComposeDriver.HandleRun) Exit status from run command: 0\n
2024/02/04 07:20:12 [ 5] DEBUG: (main.DockerComposeDriver.HandleRun) Collecting information from non default containers\n
2024/02/04 07:20:12 [ 8] ERROR: (main.DockerComposeDriver.getDCContainersNames) \x1b[31mUnexpected exit status:\n
Command: docker-compose -f ./test/test-files/itest-dc.yaml -f ./test/test-files/itest-dc.yaml.dojo -p testdojorunid ps --format json --all\n
Exit status: 1\n
StdOut: <empty string>\n
StdErr: Error response from daemon: No such container: 731492b22407b5d22db460ac5daee3a2e46e24286dfd4f6916b09457018eb66b\n
\x1b[0m\n
2024/02/04 07:20:12 [ 8] DEBUG: (main.DockerComposeDriver.waitForContainersToBeRunning) Containers not yet created: testdojorunid\n
2024/02/04 07:20:12 [ 5] DEBUG: (main.DockerComposeDriver.stop) Stopping containers\n
2024/02/04 07:20:12 [ 5]  INFO: (main.DockerComposeDriver.stop) Stopping containers with command: \n
docker-compose -f ./test/test-files/itest-dc.yaml -f ./test/test-files/itest-dc.yaml.dojo -p testdojorunid stop\n
Container testdojorunid-abc-1  Stopping\n
Container testdojorunid-abc-1  Stopped\n
2024/02/04 07:20:12 [ 5] DEBUG: (main.DockerComposeDriver.stop) Exit status from stop command: 0
xmik commented

This is not fixed in Dojo 0.12.0. It happens rarely, and the workaround implemented in Dojo 0.10.3 is still in place. The workaround was that we don't panic but rather print out a log message instead. However, this leads to flaky tests.