ai-dock/stable-diffusion-webui

Fails to start on Paperspace Gradient with GPU machine

MrNeon opened this issue · 8 comments

MrNeon commented

Since a couple of days ago the container fails to start when running in a GPU instance in Paperspace Gradient, for some reason it works fine on a CPU only instance but Paperspace has refused to explain why or explore the issue.

Seems to be related to the service portal and port 1111

{"what":"agent","message":"Loading settings configuration...","level":"info","timestamp":"2023-11-29T09:26:39.278Z"}
{"port":8889,"level":"info","message":"Started on","timestamp":"2023-11-29T09:26:39.279Z"}
{"message":"Finished initializing","level":"info","timestamp":"2023-11-29T09:26:39.280Z"}
/opt/ai-dock/bin/init.sh: line 335: =: command not found
Environment unsuitable for rclone mount...
rclone remains available via CLI
Looking for config.sh...
Not found
2023-11-29 09:26:40,703 INFO Included extra file "/etc/supervisor/supervisord/conf.d/caddy.conf" during parsing
2023-11-29 09:26:40,703 INFO Included extra file "/etc/supervisor/supervisord/conf.d/cloudflared.conf" during parsing
2023-11-29 09:26:40,703 INFO Included extra file "/etc/supervisor/supervisord/conf.d/logtail.conf" during parsing
2023-11-29 09:26:40,703 INFO Included extra file "/etc/supervisor/supervisord/conf.d/quicktunnel.conf" during parsing
2023-11-29 09:26:40,703 INFO Included extra file "/etc/supervisor/supervisord/conf.d/rclone_mount.conf" during parsing
2023-11-29 09:26:40,703 INFO Included extra file "/etc/supervisor/supervisord/conf.d/serverless.conf" during parsing
2023-11-29 09:26:40,703 INFO Included extra file "/etc/supervisor/supervisord/conf.d/serviceportal.conf" during parsing
2023-11-29 09:26:40,703 INFO Included extra file "/etc/supervisor/supervisord/conf.d/sshd.conf" during parsing
2023-11-29 09:26:40,704 INFO Set uid to user 0 succeeded
2023-11-29 09:26:40,711 INFO RPC interface 'supervisor' initialized
2023-11-29 09:26:40,711 CRIT Server 'inet_http_server' running without any HTTP authentication checking
2023-11-29 09:26:40,712 INFO RPC interface 'supervisor' initialized
2023-11-29 09:26:40,712 INFO supervisord started with pid 92
2023-11-29 09:26:41,715 INFO spawned: 'logtail' with pid 98
2023-11-29 09:26:41,718 INFO spawned: 'serverless' with pid 99
2023-11-29 09:26:41,720 INFO spawned: 'serviceportal' with pid 100
2023-11-29 09:26:41,722 INFO spawned: 'sshd' with pid 102
2023-11-29 09:26:41,723 INFO spawned: 'caddy' with pid 103
Starting logtail service...
2023-11-29 09:26:43,723 INFO success: serverless entered RUNNING state, process has stayed up for \u003e than 2 seconds (startsecs)
Gathering logs...==\u003e /var/log/config.log \u003c==

==\u003e /var/log/sync.log \u003c==
Skipping workspace sync: Mamba environments remain in /opt

==\u003e /var/log/preflight.log \u003c==
Looking for preflight.sh...
Empty preflight.sh...

==\u003e /var/log/debug.log \u003c==

==\u003e /var/log/provisioning.log \u003c==
Looking for provisioning.sh...
Not found

==\u003e /var/log/supervisor/caddy.log \u003c==
{"level":"info","ts":1701250003.848939,"msg":"using provided configuration","config_file":"/opt/caddy/etc/Caddyfile","config_adapter":""}
{"level":"warn","ts":1701250003.8504975,"logger":"http.auto_https","msg":"automatic HTTPS is completely disabled for server","server_name":"srv0"}
{"level":"warn","ts":1701250003.8505342,"logger":"http.auto_https","msg":"automatic HTTPS is completely disabled for server","server_name":"srv1"}
Error: loading initial config: loading new config: http app module: start: listening on :11111: listen tcp :11111: bind: address already in use

==\u003e /var/log/supervisor/serverless.log \u003c==
Refusing to start serverless worker without $SERVERLESS=true

==\u003e /var/log/supervisor/serviceportal.log \u003c==
Starting Service Portal...
INFO:     Started server process [100]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:11111 (Press CTRL+C to quit)

==\u003e /var/log/supervisor/sshd.log \u003c==
/root/.ssh/authorized_keys is not a public key file.\r
Skipping SSH server: No public key

==\u003e /var/log/supervisor/supervisor.log \u003c==
2023-11-29 09:26:40,703 INFO Included extra file "/etc/supervisor/supervisord/conf.d/caddy.conf" during parsing
2023-11-29 09:26:40,703 INFO Included extra file "/etc/supervisor/supervisord/conf.d/cloudflared.conf" during parsing
2023-11-29 09:26:40,703 INFO Included extra file "/etc/supervisor/supervisord/conf.d/logtail.conf" during parsing
2023-11-29 09:26:40,703 INFO Included extra file "/etc/supervisor/supervisord/conf.d/quicktunnel.conf" during parsing
2023-11-29 09:26:40,703 INFO Included extra file "/etc/supervisor/supervisord/conf.d/rclone_mount.conf" during parsing
2023-11-29 09:26:40,703 INFO Included extra file "/etc/supervisor/supervisord/conf.d/serverless.conf" during parsing
2023-11-29 09:26:40,703 INFO Included extra file "/etc/supervisor/supervisord/conf.d/serviceportal.conf" during parsing
2023-11-29 09:26:40,703 INFO Included extra file "/etc/supervisor/supervisord/conf.d/sshd.conf" during parsing
2023-11-29 09:26:40,704 INFO Set uid to user 0 succeeded
2023-11-29 09:26:40,711 INFO RPC interface 'supervisor' initialized
2023-11-29 09:26:40,711 CRIT Server 'inet_http_server' running without any HTTP authentication checking
2023-11-29 09:26:40,712 INFO RPC interface 'supervisor' initialized
2023-11-29 09:26:40,712 INFO supervisord started with pid 92
2023-11-29 09:26:41,715 INFO spawned: 'logtail' with pid 98
2023-11-29 09:26:41,718 INFO spawned: 'serverless' with pid 99
2023-11-29 09:26:41,720 INFO spawned: 'serviceportal' with pid 100
2023-11-29 09:26:41,722 INFO spawned: 'sshd' with pid 102
2023-11-29 09:26:41,723 INFO spawned: 'caddy' with pid 103
2023-11-29 09:26:43,723 INFO success: serverless entered RUNNING state, process has stayed up for \u003e than 2 seconds (startsecs)
2023-11-29 09:26:44,727 INFO exited: caddy (exit status 1; not expected)
2023-11-29 09:26:44,727 INFO exited: caddy (exit status 1; not expected)
2023-11-29 09:26:45,728 INFO spawned: 'caddy' with pid 141
2023-11-29 09:26:45,728 INFO spawned: 'caddy' with pid 141
2023-11-29 09:26:46,728 INFO success: logtail entered RUNNING state, process has stayed up for \u003e than 5 seconds (startsecs)
2023-11-29 09:26:46,728 INFO success: serviceportal entered RUNNING state, process has stayed up for \u003e than 5 seconds (startsecs)
2023-11-29 09:26:46,728 INFO success: sshd entered RUNNING state, process has stayed up for \u003e than 5 seconds (startsecs)
2023-11-29 09:26:46,728 INFO success: logtail entered RUNNING state, process has stayed up for \u003e than 5 seconds (startsecs)
2023-11-29 09:26:46,728 INFO success: serviceportal entered RUNNING state, process has stayed up for \u003e than 5 seconds (startsecs)
2023-11-29 09:26:46,728 INFO success: sshd entered RUNNING state, process has stayed up for \u003e than 5 seconds (startsecs)
2023-11-29 09:26:47,861 INFO exited: caddy (exit status 1; not expected)

==\u003e /var/log/supervisor/caddy.log \u003c==
{"level":"info","ts":1701250007.8570542,"msg":"using provided configuration","config_file":"/opt/caddy/etc/Caddyfile","config_adapter":""}
{"level":"warn","ts":1701250007.8587267,"logger":"http.auto_https","msg":"automatic HTTPS is completely disabled for server","server_name":"srv0"}
{"level":"warn","ts":1701250007.8587873,"logger":"http.auto_https","msg":"automatic HTTPS is completely disabled for server","server_name":"srv1"}
Error: loading initial config: loading new config: http app module: start: listening on :11111: listen tcp :11111: bind: address already in use

==\u003e /var/log/supervisor/supervisor.log \u003c==
2023-11-29 09:26:47,861 INFO exited: caddy (exit status 1; not expected)
2023-11-29 09:26:50,732 INFO spawned: 'caddy' with pid 174
2023-11-29 09:26:51,723 INFO exited: serverless (exit status 0; expected)
2023-11-29 09:26:50,732 INFO spawned: 'caddy' with pid 174
2023-11-29 09:26:51,723 INFO exited: serverless (exit status 0; expected)
2023-11-29 09:26:52,732 INFO exited: sshd (exit status 0; expected)

==\u003e /var/log/supervisor/caddy.log \u003c==
{"level":"info","ts":1701250012.8670478,"msg":"using provided configuration","config_file":"/opt/caddy/etc/Caddyfile","config_adapter":""}
{"level":"warn","ts":1701250012.8683994,"logger":"http.auto_https","msg":"automatic HTTPS is completely disabled for server","server_name":"srv0"}
{"level":"warn","ts":1701250012.8684745,"logger":"http.auto_https","msg":"automatic HTTPS is completely disabled for server","server_name":"srv1"}
Error: loading initial config: loading new config: http app module: start: listening on :11111: listen tcp :11111: bind: address already in use

==\u003e /var/log/supervisor/supervisor.log \u003c==
2023-11-29 09:26:52,732 INFO exited: sshd (exit status 0; expected)
2023-11-29 09:26:53,729 INFO exited: caddy (exit status 1; not expected)
2023-11-29 09:26:53,729 INFO exited: caddy (exit status 1; not expected)
2023-11-29 09:26:56,732 INFO spawned: 'caddy' with pid 204
2023-11-29 09:26:56,732 INFO spawned: 'caddy' with pid 204

==\u003e /var/log/supervisor/caddy.log \u003c==
{"level":"info","ts":1701250018.8984635,"msg":"using provided configuration","config_file":"/opt/caddy/etc/Caddyfile","config_adapter":""}
{"level":"warn","ts":1701250018.8995972,"logger":"http.auto_https","msg":"automatic HTTPS is completely disabled for server","server_name":"srv0"}
{"level":"warn","ts":1701250018.8996172,"logger":"http.auto_https","msg":"automatic HTTPS is completely disabled for server","server_name":"srv1"}
Error: loading initial config: loading new config: http app module: start: listening on :11111: listen tcp :11111: bind: address already in use
2023-11-29 09:26:59,730 INFO exited: caddy (exit status 1; not expected)

==\u003e /var/log/supervisor/supervisor.log \u003c==
2023-11-29 09:26:59,730 INFO exited: caddy (exit status 1; not expected)
2023-11-29 09:27:00,730 INFO gave up: caddy entered FATAL state, too many start retries too quickly
2023-11-29 09:27:00,730 INFO gave up: caddy entered FATAL state, too many start retries too quickly
{"message":"Received stop signal SIGTERM, shutting down...","level":"info","timestamp":"2023-11-29T09:28:09.025Z"}
{"message":"All mounts released. Exiting...","level":"info","timestamp":"2023-11-29T09:28:09.030Z"}
[INFO  tini (1)] Spawned child process './integrations-sidecar' with pid '7'
[INFO  tini (1)] Main child exited normally (with status '0')

I'm not familiar with the workings of the container but would it be possible to disable the service portal using an environment variable or using the provisioning script?

It looks like caddy is trying to bind on port 11111, which is the port the service portal takes. Caddy is supposed to be bound to 1111 for service portal. Do you have any environment variables being passed to the container?

You may be able to pass SERVICEPORTAL_PORT_HOST=1111 (or any other port number) to force it to start.

I have no answer as to why this would happen only on GPU instances.

MrNeon commented

the command I've used without issue for several weeks is

init.sh WORKSPACE=/storage WORKSPACE_SYNC=false WEB_ENABLE_AUTH=false

I tried SERVICEPORTAL_PORT_HOST=1111 but the same issue happens with port 11111.
However when I tried SERVICEPORTAL_PORT_HOST=9797 the error message then complained about port 18888 being in use instead.

Unfortunately Paperspace makes it a pain in the ass to test things out with the inability to change the command and other settings of a notebook after creating it and right now the notebook is stuck setting up an image...

That's interesting because the complaint about 18888 is the local jupyter that gets proxied at 8888, which suggests it's got service portal running and is failing further on.

Unfortunately I have forgotten to make the jupyter and webui ports dynamic so they can't be changed through envs.

I'll rebuild the images to support that and then it may start working again - But none of this should be necessary as it's working as expected on other platforms.

I stopped using Paperspace because of their awkwardness and I usually recommend vast and RunPod. I'll still try to fix it though

MrNeon commented

Thanks for your work.

I have to admit I created the issue here because it seemed the most visible place but I've actually been using the jupyter-pytorch image (specifically 2.1.0-py3.10-cuda-12.1.0-base-22.04) for several weeks.

I tried stable-diffusion-webui:latest-cuda-jupyter just now and it started with zero issues... maybe that helps narrowing down the problem with jupyter-pytorch. I also tried base-image and it errors out in the same manner due to port 11111.

I have rebuilt the images from base-image upto jupyter-pytorch and have successfully started the image you are using on Paperspace.

I've had issues with the Paperspace UI crashing but everything seems to work through the tunnel links.

MrNeon commented

Sorry, got distracted by Advent of Code and real life and couldn't test them.
Started a machine with jupyter-pytorch:2.1.1-py3.10-cuda-12.1.0-base-22.04 and no issues!

Fabulous!

Thanks for letting me know 🙂