firecow/gitlab-ci-local

Leaks networks on every run

Opened this issue · 10 comments

Minimal .gitlab-ci.yml illustrating the issue

---
docker_build:
  stage: package
  image: docker:latest
  services:
    - docker:dind
  script:
    - echo "blablabla"

Expected behavior
After running, clear the network that was needed for the service and the job container to talk together.

Host information
MacOS
gitlab-ci-local 4.55.0

Containerd binary
docker

Additional context
https://github.com/firecow/gitlab-ci-local/blob/master/src/job.ts#L543 < not tracked for cleanup

I cannot reproduce

image

image

Plus the code you are referencing does in fact illustrate that a serviceNetworkId is stored and used in the cleanup function.

Hi @firecow ,

This is the output of my "docker network ls" currently (minus some redacted stuff for my company);

➜   docker network ls
NETWORK ID     NAME                     DRIVER    SCOPE
51fbb22039db   bridge                   bridge    local
9fe84a65a347   docker_gwbridge          bridge    local
6df0aef1c8a5   gitlab-ci-local-130397   bridge    local
6e78377fe61e   gitlab-ci-local-200711   bridge    local
29b6022248e3   gitlab-ci-local-201744   bridge    local
176f9fc46cc9   gitlab-ci-local-235698   bridge    local
dd8619c29826   gitlab-ci-local-284263   bridge    local
d9cc612fdb5a   gitlab-ci-local-351190   bridge    local
884c9c02eee9   gitlab-ci-local-371592   bridge    local
d147c413d3f5   gitlab-ci-local-375682   bridge    local
1f7e90481cfc   gitlab-ci-local-501394   bridge    local
cdbf32f7f9e6   gitlab-ci-local-535650   bridge    local
1b4057b7b5f9   gitlab-ci-local-558862   bridge    local
b6b57e9795c8   gitlab-ci-local-574073   bridge    local
7e34c53c5bff   gitlab-ci-local-579972   bridge    local
ccd262ce6df9   gitlab-ci-local-654062   bridge    local
02cb192c820a   gitlab-ci-local-668695   bridge    local
e866a4a3540a   gitlab-ci-local-714030   bridge    local
23964309e2f7   gitlab-ci-local-738116   bridge    local
54d988391d24   gitlab-ci-local-768931   bridge    local
f8da1545297a   host                     host      local
y18vrgxhbg68   ingress                  overlay   swarm

I do see the network clean up after in theory, but in practice, something is probably going wrong. My guess is that I should be seeing some message somewhere based on https://github.com/firecow/gitlab-ci-local/blob/master/src/job.ts#L593 or maybe the assert of the containers is what caused the network cleanup to be skipped? I believe the latter may be the case considering those assert are within the catch. I'm not familiar with Javascript "assert" and the best practices, however, I fail to understand how it wouldn't be an "Error" instance...

likely these are leaked from the test suites, you can replicate it by npm run test and after few seconds

I never ran the test suite of gitlab-ci-local. I run gitlab-ci-local from the brew installation.

These are probably leaked from my job that run a docker in docker and failed to complete. The failure probably triggered some other failures (containers not being removed and skipping the rest of the cleanup ?)

hmm, ic, not sure then.. i dont really run docker-in-docker pipeline

hopefully, it's something that is replicable

I just had the problem with a job that just runs out of a container... no issue that I know of...

For those having the same issue, here is what I ran ;

 for network in $(docker network ls); do if [[ "$network" == *"gitlab"* ]]; then echo "$network"; docker network rm $network; fi ; done 
gitlab-ci-local-9409
gitlab-ci-local-9409
gitlab-ci-local-95666
gitlab-ci-local-95666
gitlab-ci-local-130397
gitlab-ci-local-130397
gitlab-ci-local-200711
gitlab-ci-local-200711
gitlab-ci-local-201744
gitlab-ci-local-201744
gitlab-ci-local-235698
gitlab-ci-local-235698
gitlab-ci-local-284263
gitlab-ci-local-284263
gitlab-ci-local-351190
gitlab-ci-local-351190
gitlab-ci-local-371592
gitlab-ci-local-371592
gitlab-ci-local-375682
gitlab-ci-local-375682
gitlab-ci-local-451685
gitlab-ci-local-451685
gitlab-ci-local-501394
gitlab-ci-local-501394
gitlab-ci-local-509319
gitlab-ci-local-509319
gitlab-ci-local-535650
gitlab-ci-local-535650
gitlab-ci-local-536928
gitlab-ci-local-536928
gitlab-ci-local-558862
gitlab-ci-local-558862
gitlab-ci-local-562280
gitlab-ci-local-562280
gitlab-ci-local-574073
gitlab-ci-local-574073
gitlab-ci-local-579972
gitlab-ci-local-579972
gitlab-ci-local-654062
gitlab-ci-local-654062
gitlab-ci-local-668695
gitlab-ci-local-668695
gitlab-ci-local-700167
gitlab-ci-local-700167
gitlab-ci-local-714030
gitlab-ci-local-714030
gitlab-ci-local-738116
gitlab-ci-local-738116
gitlab-ci-local-768931
gitlab-ci-local-768931
gitlab-ci-local-788165
gitlab-ci-local-788165
gitlab-ci-local-859507
gitlab-ci-local-859507

I've got it again this morning :) I'm pretty sure it accumulate on a job failure...

I'm currently trying to debug a job we have defined that starts a service container of mockserver, start our webcomponent and start testing against the webcomponent. I got a failure in my tests, which fails the job, but nothing special about it...

I've ran that job over 50 times at least yesterday...

Got it again yesterday, and I had it today as well. I will try to notice which "job" leave some network behind... the problem I have is that my workflow is dependant on gitlab-ci-local to run anything 😅 (we're bought on the concept of everything needs to be runnable in the CI and locally).

However, the jobs I've been running were just starting a service for mockserver and the other was a shell job (no relation to docker)...

At this point, I'm pretty sure it's when the job fails and has a service that the network isn't cleaned... I don't know how I could dig out more information for this ticket. If you have an idea, please let me know.

Ok, I can now say for sure that the leak is happening on successful run as well...

The job that leaks is using a service with an alias... I have yet to be able to determine what causes the leak... is it the service container not closing fast enough ?

I get similar errors after an unspecific number of runs.

Error: Command failed with exit code 1: docker network create gitlab-ci-local-261361
Error response from daemon: all predefined address pools have been fully subnetted
    at makeError (/opt/homebrew/Cellar/gitlab-ci-local/4.56.0/libexec/lib/node_modules/gitlab-ci-local/node_modules/execa/lib/error.js:60:11)
    at handlePromise (/opt/homebrew/Cellar/gitlab-ci-local/4.56.0/libexec/lib/node_modules/gitlab-ci-local/node_modules/execa/index.js:118:26)
    at process.processTicksAndRejections (node:internal/process/task_queues:105:5)
    at async Job.createDockerNetwork (file:///opt/homebrew/Cellar/gitlab-ci-local/4.56.0/libexec/lib/node_modules/gitlab-ci-local/src/job.js:1053:39)
    at async Job.start (file:///opt/homebrew/Cellar/gitlab-ci-local/4.56.0/libexec/lib/node_modules/gitlab-ci-local/src/job.js:442:13)
    at async /opt/homebrew/Cellar/gitlab-ci-local/4.56.0/libexec/lib/node_modules/gitlab-ci-local/node_modules/p-map/index.js:57:22

I attempted to increase the number in the pool, but that's only a band-aid. The real fix would be killing the networks when docker kills the containers.

I run docker network prune to remove all networks and just start over.