Atinoda/text-generation-webui-docker

Running without a GPU

Sharpz7 opened this issue ยท 9 comments

Hey,

I was wanting to check if it is possible to run this container without a GPU?

Thanks,

You sure can, and there's some instructions in #9 that should help you set it up - basically, just comment out all the gpu parts in the docker-compose.yml (or don't include --gpus all if you're running without compose).

You'll need to be a patient man though - it's slow as molasses without a GPU!

This didn't seem to work in my environment, and it errors that it can't find a GPU when you load a model. I will try some things and get back to you.

I managed to get it by using this guide: https://github.com/oobabooga/text-generation-webui/blob/main/docs/Low-VRAM-guide.md

And making this change:

command: ["python", "/app/server.py", "--auto-devices"]

version: "3"
services:
  text-generation-webui-docker:
    image: atinoda/text-generation-webui:default # Specify variant as the :tag
    container_name: text-generation-webui
    environment:
      - EXTRA_LAUNCH_ARGS="--listen --verbose" # Custom launch args (e.g., --model MODEL_NAME)
#      - BUILD_EXTENSIONS_LIVE="silero_tts whisper_stt" # Install named extensions during every container launch. THIS WILL SIGNIFICANLTLY SLOW LAUNCH TIME.
    ports:
      - 7860:7860  # Default web port
#      - 5000:5000  # Default API port
#      - 5005:5005  # Default streaming port
#      - 5001:5001  # Default OpenAI API extension port
    volumes:
      - ./config/loras:/app/loras
      - ./config/models:/app/models
      - ./config/presets:/app/presets
      - ./config/prompts:/app/prompts
      - ./config/softprompts:/app/softprompts
      - ./config/training:/app/training
#      - ./config/extensions:/app/extensions  # Persist all extensions
#      - ./config/extensions/silero_tts:/app/extensions/silero_tts  # Persist a single extension
    logging:
      driver:  json-file
      options:
        max-file: "3"   # number of files or file count
        max-size: '10m'
    command: ["python", "/app/server.py", "--auto-devices"]
    # deploy:
    #     resources:
    #       reservations:
    #         devices:
    #           - driver: nvidia
    #             device_ids: ['0']
    #             capabilities: [gpu]
    ```

Thanks for sharing your fix and confirming that it works with CPU only on your system. Enjoy your LLM-ing, and make sure your CPU cooler is tuned up!

PS. You can append --auto-devices to the EXTRA_LAUNCH_ARGS environment variable, instead of editing the CMD.

I also realised I was being silly - you can configure it from the settings:

https://drive.google.com/uc?id=1UEjDNVtbBh4oAdb4k_WJHPYpdpSXI2Kj

Thanks for the quick response. Looking forward to doing my LLM testing with this UI :))

If you would be interested in having a helm chart in this repo as well, I'd be happy to contribute

You sure can, and there's some instructions in #9 that should help you set it up - basically, just comment out all the gpu parts in the docker-compose.yml (or don't include --gpus all if you're running without compose).

You'll need to be a patient man though - it's slow as molasses without a GPU!

Hi @Atinoda, does "running without gpu" assume to also use the provided Dockerfile? Imho the base image there from cuda cannot be scheduled on a machine without gpu?

Hi @globavi - since this discussion there is a llama-cpu image available (see #16 ). It still uses the CUDA base image but it should work fine (I was able to run it on an Intel laptop that has only an iGPU). Can you please try it out and let me know if you run into any problems?

Hi @Atinoda,

I could start the app with the new image (adapted few things for me as i do not use docker compose but azure infrastructure) but after downloading a GGML model in the load_model process it says:

2023-08-22 08:19:23 INFO:Loading TheBloke_Llama-2-7B-Chat-GGML... โ”‚ โ”‚ CUDA error 35 at ggml-cuda.cu:4883: CUDA driver version is insufficient for CUDA runtime version โ”‚ โ”‚ /arrow/cpp/src/arrow/filesystem/s3fs.cc:2598: arrow::fs::FinalizeS3 was not called even though S3 was initialized. This could lead to a segmentation fault at exit โ”‚ โ”‚ Stream closed EOF for customer-dev/claims-sle-textgen-ui-bash-684c9488c6-g4rxk (textgen-webui)

Hey,

I was wondering if iGPU infering is a thing?
I'm not sure if there would be any gains against CPU, but I'm curious
I don't find a way to make it work.