DDMAL/Rodan

production server with multiple instances

Closed this issue · 3 comments

Updated on July 10, 2024

  • rodan2.simssa.ca
    is the production server with a single vGPU instance. Anything except PACO training can run. GPU jobs are fast. All other jobs are slow. There's also a popup login window issue on the homepage, so if you need to create a new account, please do so and visit the activation link on your phone.
  • rodan.simssa.ca
    is the server with two instances (more information below). Currently, nothing can be done on this server. There is no popup login window on the homepage. Account creation is normal. Anything except PACO training should be able to run on this server now. It should function the same as rodan2.simssa.ca but everything is significantly faster.

I tested with two small vGPU instance today that it is possible to do something like "distributed computing" with Docker swarm so that we put containers on different instances. Since we do not know when we will get a free spot on Arbutus, I suggest we deploy production Rodan on two smaller instances instead of a single larger one.

  1. (the original plan) use g1-16gb-c8-40gb which has 8 vCPUs and 40GB RAM with a vGPU of 16GM RAM.
  2. (another option) use g1-8gb-c4-22gb and p16-16gb, for example, so that we have all containers except GPU-celery run with 16 vCPUs and 16GB RAM, and the GPU-celery container has 22GB RAM and 4 vCPUs to share with iipsrv. Although this vGPU has 8 instead of 16GB RAM, it still seems to perform better than current the staging GPU.
  3. (current stack) use g1-8gb-c4-22gb for everything. GPU jobs are fast but non-GPU jobs are really slow because we don't have enough vCPUs.

Option 1 needs 8 vCPUs, and 40GB RAM, while option 2 needs 20 vCPUs and 38GB RAM. Since we are mainly running short of RAM after the extension, this way we actually have more vCPUs to improve performance of other non-GPU containers. We can also use the remaining 2GB RAM to have a separate instance for data storage.

Based on my experiments, we only need to deploy (and later update) the stack on the instance with the manager node and Docker swarm will handle the rest as long as we correctly join the network and label the worker node.

The only trouble I encountered so far is the worker instance for the GPU container needs to access data stored on non-GPU instance. But I believe this is possible. One common practice is using NFS, which I will try this week and report back.

Update on Jun 27, 2024
We already have one g1-16gb-c8-40gb vGPU instance, but with wrong OS. (Ubuntu 22.04 has wrong DNS resolution in Docker Swarm so we always get redis timeout error, after many trials. Since it will be extremely difficult to launch a new vGPU instance of this flavor and it was purely luck that enabled us to launch this one, I was hoping to make use of this server. However, it cannot work as a worker node. Nor can it be rebuilt with the correct OS.

openstack server rebuild --image 484b2b0c-a9ba-4d8e-b966-5735b5a6f8dc test_rebuild
Attempting to rebuild a volume-backed server using --os-compute-api-version 2.92 or earlier, which will only succeed if the image is identical to the one initially used. This will be an error in a future release.
Image 484b2b0c-a9ba-4d8e-b966-5735b5a6f8dc is unacceptable: Unable to rebuild with a different image for a volume-backed server. (HTTP 400) (Request-ID: req-742195ef-5643-4d96-9d9b-6a1c2977fc23)

As a result, we will just delete this instance because it turns out to be almost useless for us.

  • Current plan:
    In case we need a second server for classifying etc while staging is training, I will revert back the current production so that rodan2 can be used and do everything from scratch with new instances and port it to somewhere else.

  • Notes for picking the manager instance (without GPU):
    We want more vCPUs for all other non GPU jobs to run smoothly. According to Arbutus, instance flavors are named as such

"c" designates "compute", "p" designates "persistent", and "g" designates "vGPU".
"c" flavor is targeted towards CPU intensive tasks, while "p" flavor is geared towards web servers, data base servers and instances that have a lower CPU or bursty CPU usage profile in general.

However, "c" instances are expensive in RAM. If we go with one g1-8gb-c4-22gb for GPU worker instance, we have around 40GB left. We can only afford c8-30gb-288 among "c" flavors. However, we can still get p16 with RAM options for 16, 24, and 32 GB. I will try with p16-32gb first because we do not want to waste extra resources to prevent Compute Canada from down grading us for the next year.

We now have rodan2 prod back for all tasks except PACO training with GPU. Distributed prod with one "p" instance and one vGPU instance still needs more testing before NFS can be deployed. There is a "broken pipe" for rodan-main with the "p" instance now.

Update on July 3, 2024
This is with p16-24gb as the manager host on Debian 12. I have tested on a few instances and it looks like we need Docker <= 26 to be able to set up the swarm network successfully. The latest Docker 27 leads to errors here and there.
Screenshot 2024-07-03 at 10 11 40 AM
To be able to port url to the manager instance ip and run GPU container with vGPU instance, we need to use the following production.yml.

version: "3.4"

services:

  nginx:
    image: "ddmal/nginx:v3.0.0"
    deploy:
      replicas: 1
      resources:
        reservations:
          cpus: "0.5"
          memory: 0.5G
        limits:
          cpus: "0.5"
          memory: 0.5G
      restart_policy:
        condition: any
        delay: 5s
        window: 30s
      placement:
        constraints:
          - node.role == manager
    healthcheck:
      test: ["CMD", "/usr/sbin/service", "nginx", "status"]
      interval: "30s"
      timeout: "10s"
      retries: 10
      start_period: "5m"
    command: /run/start
    environment:
      TZ: America/Toronto
      SERVER_HOST: rodan.simssa.ca
      TLS: 1
    ports:
      - "80:80"
      - "443:443"
      - "5671:5671"
      - "9002:9002"
    volumes:
      - "resources:/rodan/data"

  rodan-main:
    image: "ddmal/rodan-main:v3.0.0"
    deploy:
      replicas: 1
      resources:
        reservations:
          cpus: "1"
          memory: 3G
        limits:
          cpus: "1"
          memory: 3G
      restart_policy:
        condition: any
        delay: 5s
        window: 30s
      placement:
        constraints:
          - node.role == manager
    healthcheck:
      test: ["CMD-SHELL", "/usr/bin/curl -H 'User-Agent: docker-healthcheck' http://localhost:8000/api/?format=json || exit 1"]
      interval: "30s"
      timeout: "30s"
      retries: 5
      start_period: "2m"
    command: /run/start
    environment:
      TZ: America/Toronto
      SERVER_HOST: rodan.simssa.ca
      CELERY_JOB_QUEUE: None
    env_file:
      - ./scripts/production.env
    volumes:
      - "resources:/rodan/data"

  rodan-client:
    image: "ddmal/rodan-client:nightly"
    deploy:
      placement:
        constraints:
          - node.role == manager
    volumes:
        - "./rodan-client/config/configuration.json:/client/configuration.json"

  iipsrv:
    image: "ddmal/iipsrv:nightly"
    volumes:
      - "resources:/rodan/data"

  celery:
    image: "ddmal/rodan-main:v3.0.0"
    deploy:
      replicas: 1
      resources:
        reservations:
          cpus: "0.8"
          memory: 2G
        limits:
          cpus: "0.8"
          memory: 2G
      restart_policy:
        condition: any
        delay: 5s
        window: 30s
      placement:
        constraints:
          - node.role == manager
    healthcheck:
      test: ["CMD", "celery", "inspect", "ping", "-A", "rodan", "--workdir", "/code/Rodan", "-d", "celery@celery", "-t", "30"]
      interval: "30s"
      timeout: "30s"
      start_period: "1m"
      retries: 5
    command: /run/start-celery
    environment:
      TZ: America/Toronto
      SERVER_HOST: rodan.simssa.ca
      CELERY_JOB_QUEUE: celery
    env_file:
      - ./scripts/production.env
    volumes:
      - "resources:/rodan/data"

  py3-celery:
    image: "ddmal/rodan-python3-celery:v3.0.0"
    deploy:
      replicas: 1
      resources:
        reservations:
          cpus: "3"
          memory: 3G
        limits:
          cpus: "3"
          memory: 3G
      restart_policy:
        condition: any
        delay: 5s
        window: 30s
      placement:
        constraints:
          - node.role == manager
    healthcheck:
      test: ["CMD", "celery", "inspect", "ping", "-A", "rodan", "--workdir", "/code/Rodan", "-d", "celery@Python3", "-t", "30"]
      interval: "30s"
      timeout: "30s"
      retries: 5
    command: /run/start-celery
    environment:
      TZ: America/Toronto
      SERVER_HOST: rodan.simssa.ca
      CELERY_JOB_QUEUE: Python3
    env_file:
      - ./scripts/production.env
    volumes:
      - "resources:/rodan/data"

  gpu-celery:
    image: "ddmal/rodan-gpu-celery:v3.0.0"
    deploy:
      replicas: 1
      resources:
        reservations:
          cpus: "3"
          memory: 18G
        limits:
          cpus: "3"
          memory: 18G
      placement:
        constraints:
          - node.role == worker
      restart_policy:
        condition: any
        delay: 5s
        window: 30s
    healthcheck:
      test: ["CMD", "celery", "inspect", "ping", "-A", "rodan", "--workdir", "/code/Rodan", "-d", "celery@GPU", "-t", "30"]
      interval: "30s"
      timeout: "30s"
      retries: 5
    command: /run/start-celery
    environment:
      TZ: America/Toronto
      SERVER_HOST: rodan.simssa.ca
      CELERY_JOB_QUEUE: GPU
    env_file:
      - ./scripts/production.env
    volumes:
      - "resources:/rodan/data"

  redis:
    image: "redis:alpine"
    deploy:
      replicas: 1
      resources:
        reservations:
          cpus: "1"
          memory: 2G
        limits:
          cpus: "1"
          memory: 2G
      restart_policy:
        condition: any
        delay: 5s
        window: 30s
      placement:
        constraints:
          - node.role == manager
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 5s
      retries: 5
    environment:
      TZ: America/Toronto

  postgres:
    image: "ddmal/postgres-plpython:v3.0.0"
    deploy:
      replicas: 1
      endpoint_mode: dnsrr
      resources:
        reservations:
          cpus: "2"
          memory: 2G
        limits:
          cpus: "2"
          memory: 2G
      restart_policy:
        condition: any
        delay: 5s
        window: 30s
      placement:
        constraints:
          - node.role == manager
    healthcheck:
      test: ["CMD-SHELL", "pg_isready", "-U", "postgres"]
      interval: 10s
      timeout: 5s
      retries: 5
    environment:
      TZ: America/Toronto
    volumes:
      - "pg_data:/var/lib/postgresql/data"
      - "pg_backup:/backups"
    env_file:
      - ./scripts/production.env

  rabbitmq:
    image: "rabbitmq:alpine"
    deploy:
      replicas: 1
      resources:
        reservations:
          cpus: "1"
          memory: 4G
        limits:
          cpus: "1"
          memory: 4G
      restart_policy:
        condition: any
        delay: 5s
        window: 30s
      placement:
        constraints:
          - node.role == manager
    healthcheck:
      test: ["CMD", "rabbitmq-diagnostics", "-q", "ping"]
      interval: "30s"
      timeout: "3s"
      retries: 3
    environment:
      TZ: America/Toronto
    env_file:
      - ./scripts/production.env

volumes:
  resources:
  pg_backup:
  pg_data:

Essentially we have only gpu-celery on the worker instance. We cannot specify deploy constraint for iipsrv so it might be deployed on the worker instance as well. All other instances should be on the manager node with sufficient vGPU and RAM.
This server is up on rodan.simssa.ca. The login popup window on homepage seems gone as well given larger cpu and memory for each container.
Do not use Docker version >= 27. Only use Docker <= 26.

Remaining issues:

  • set up NFS (or other data sharing methods) for all jobs to run
  • make sure we can do training with GPU

The remaining issue is in #1181.

To set up NFS, I followed this guide and did everything except ufw (firewall), which we might set up later if needed.
I directly mounted /var/lib/docker/volumes/ on both instances and restarted gpu container.
This is what's inside /etc/exports on the manager instance.

/var/lib/docker/volumes [worker node public ip](rw,sync,no_subtree_check,no_root_squash)

(Need sudo systemctl restart nfs-kernel-server after editing.)

Good practice to detach and delete a worker instance (all on the worker instance):

  1. leave the docker swarm network
docker swarm leave --force
  1. unmount the nfs directory
sudo umount /var/lib/docker/volumes/