ai-dock/comfyui

runpod serverless permission error

hongminpark opened this issue · 11 comments

Hi. I've followed guide, added some model but I am getting this errrors..
It was okay previously, I am not sure if its related with runpod or else.

2024-02-09T13:31:26.016742103Z ==> /var/log/supervisor/storagemonitor.log <==
2024-02-09T13:31:26.016748663Z error: could not lock config file /home/user/.gitconfig: Permission denied
2024-02-09T13:31:26.016753303Z Starting storage monitor..
2024-02-09T13:31:26.016757913Z ln: failed to create symbolic link '/runpod-volume/storage/README': Permission denied
2024-02-09T13:31:26.016763463Z mkdir: cannot create directory ‘/runpod-volume/storage/stable_diffusion’: Permission denied
2024-02-09T13:31:26.016777443Z ln: failed to create symbolic link '/runpod-volume/storage/stable_diffusion/models/controlnet/control_v11p_sd15_openpose.pth': No such file or directory
2024-02-09T13:31:26.016781692Z mkdir: cannot create directory ‘/runpod-volume/storage/stable_diffusion’: Permission denied
2024-02-09T13:31:26.016785932Z ln: failed to create symbolic link '/runpod-volume/storage/stable_diffusion/models/embeddings/bad_prompt_version2-neg.pt': No such file or directory
2024-02-09T13:31:26.016790592Z mkdir: cannot create directory ‘/runpod-volume/storage/stable_diffusion’: Permission denied
2024-02-09T13:31:26.016795122Z ln: failed to create symbolic link '/runpod-volume/storage/stable_diffusion/models/embeddings/easynegative.safetensors': No such file or directory
2024-02-09T13:31:26.016798732Z mkdir: cannot create directory ‘/runpod-volume/storage/stable_diffusion’: Permission denied
2024-02-09T13:31:26.016802422Z ln: failed to create symbolic link '/runpod-volume/storage/stable_diffusion/models/embeddings/ng_deepnegative_v1_75t.pt': No such file or directory
2024-02-09T13:31:26.016805872Z mkdir: cannot create directory ‘/runpod-volume/storage/stable_diffusion’: Permission denied
2024-02-09T13:31:26.016811022Z ln: failed to create symbolic link '/runpod-volume/storage/stable_diffusion/models/ultralytics/bbox/face_yolov8m.pt': No such file or directory
2024-02-09T13:31:26.016816342Z mkdir: cannot create directory ‘/runpod-volume/storage/stable_diffusion’: Permission denied
2024-02-09T13:31:26.016821142Z ln: failed to create symbolic link '/runpod-volume/storage/stable_diffusion/models/ultralytics/segm/sam_vit_b_01ec64.pth': No such file or directory
2024-02-09T13:31:26.016825712Z mkdir: cannot create directory ‘/runpod-volume/storage/stable_diffusion’: Permission denied
2024-02-09T13:31:26.016830732Z ln: failed to create symbolic link '/runpod-volume/storage/stable_diffusion/models/vae/vae-ft-mse-840000-ema-pruned.safetensors': No such file or directory
2024-02-09T13:31:26.016841892Z Setting up watches.  Beware: since -r was given, this may take a while!
2024-02-09T13:31:26.016846572Z Watches established.

I've noticed recent unusual behaviour with the runpod network volumes. I'm not sure what, if anything, they have changed but I'll investigate and try to work around.

Im running into the same issue as well. The changes I've made are adding some models/nodes to layer1/init.sh as well as these lines

    build_extra_get_models \
        "/opt/storage/stable_diffusion/models/ipadapter" \
        "${IPADAPTER_MODELS[@]}"
    build_extra_get_models \
        "/opt/storage/stable_diffusion/models/clip_vision" \
        "${CLIP_VISION_MODELS[@]}"

Adding to mappings.sh

storage_map["stable_diffusion/models/ipadapter"]="/opt/ComfyUI/models/ipadapter"
storage_map["stable_diffusion/models/clip"]="/opt/ComfyUI/models/clip"
storage_map["stable_diffusion/models/clip_vision"]="/opt/ComfyUI/models/clip_vision"

Changed the tags/image to my own ghcr repo

version: "3.8"
services:
  supervisor:
    build:
      context: ./build
      args:
        IMAGE_BASE: ${IMAGE_BASE:-ghcr.io/ai-dock/jupyter-pytorch:2.1.1-py3.10-cuda-11.8.0-base-22.04}
      tags:
        - "ghcr.io/bkunbargi/comfybrev:jupyter-pytorch:2.1.1-py3.10-cuda-11.8.0-base-22.04"
        
    image: ghcr.io/bkunbargi/comfybrev:jupyter-pytorch:2.1.1-py3.10-cuda-11.8.0-base-22.04

Update: I added a network volume and the issue changed to

[01qu4hfco957dp]
[info]
ln: failed to create symbolic link '/opt/ComfyUI/models/checkpoints/v1-5-pruned-emaonly.ckpt': Permission denied
2024-02-09 15:41:30.091
[01qu4hfco957dp]
[info]
ln: failed to create symbolic link '/opt/ComfyUI/models/controlnet/control_canny-fp16.safetensors': Permission denied
2024-02-09 15:41:30.091
[01qu4hfco957dp]
[info]
ln: failed to create symbolic link '/opt/ComfyUI/models/controlnet/diff_control_sd15_depth_fp16.safetensors': Permission denied
2024-02-09 15:41:30.091
[01qu4hfco957dp]
[info]
mkdir: cannot create directory ‘/opt/ComfyUI/models/ipadapter’: Permission denied```

Ok, apologies for this. The issue has come about due to a change in the underlying images that ComfyUI is inheriting from so builds are currently broken until I push the changes up to ComfyUI but still needs some testing.

The cause is that storage monitor is now being run as a system user (user) so it doesn't have permission to write to the root owned workspace (runpod-volume). An interim fix may be to override the storagemonitor supervisor file to ensure it still runs as root, or change the permissions in the workspace. You should be able to achieve this by running your images in GPU cloud with the volume attached.

Should be resolved by Monday but hopefully earlier.

Ok, apologies for this. The issue has come about due to a change in the underlying images that ComfyUI is inheriting from so builds are currently broken until I push the changes up to ComfyUI but still needs some testing.

The cause is that storage monitor is now being run as a system user (user) so it doesn't have permission to write to the root owned workspace (runpod-volume). An interim fix may be to override the storagemonitor supervisor file to ensure it still runs as root, or change the permissions in the workspace. You should be able to achieve this by running your images in GPU cloud with the volume attached.

Should be resolved by Monday but hopefully earlier.

I see, thank you for the investigation. I have question. Is storage monitor included in this image(ai-dock/comfyui) or in runpod? And is it same with the inotifywait mentioned in README ? I want to know what are those two.

And I am deploying on runpod serverless, so I can't edit in cloud. Do I need to edit init.sh?

Both are bundled in my image. They are useful here but will be used more extensively in an upcoming image I have planned.

If you have no volume mounted you can just chown -R 1000:1000 /runpod-volume. Add it in preflight.sh on layer0. This is a temporary fix - I wouldn't normally suggest it but it may get you running for now.

That's as much help as I can offer currently as I need to test this myself. I'll update when I know more

Both are bundled in my image. They are useful here but will be used more extensively in an upcoming image I have planned.

If you have no volume mounted you can just chown -R 1000:1000 /runpod-volume. Add it in preflight.sh on layer0. This is a temporary fix - I wouldn't normally suggest it but it may get you running for now.

That's as much help as I can offer currently as I need to test this myself. I'll update when I know more

gotta try.
Thank you so much, I really appreciate for your work🙏

@hongminpark did it work for you?

@hongminpark did it work for you?

Nope, so I tried move custom files directly inside Comfy directory at init.sh

function build_extra_start() {
    build_extra_get_nodes
//...

    # Copy models directly
    mv /opt/storage/stable_diffusion/models/ckpt/* /opt/ComfyUI/models/checkpoints
    mv /opt/storage/stable_diffusion/models/controlnet/* /opt/ComfyUI/models/controlnet
    mv /opt/storage/stable_diffusion/models/embeddings/* /opt/ComfyUI/models/embeddings
}

Then the model files are correctly included but now I'm facing new problem : runpod serverless handler is not responding my api request, I am getting only 'queued' and comfyui isn't getting any request.

The new issue is related. It's all about permissions but fortunately I expect this to be resolved when the update is pushed.

Source tree has been updated and builds should now work as expected. Base bumped to PyTorch 2.2.0 on nvidia-runtime base image.

Fixed. Issues relating to permissions are solved by building against the latest PyTorch/Jupyter PyTorch images.