production server with multiple instances
Closed this issue · 3 comments
Updated on July 10, 2024
- rodan2.simssa.ca
is the production server with a single vGPU instance. Anything except PACO training can run. GPU jobs are fast. All other jobs are slow. There's also a popup login window issue on the homepage, so if you need to create a new account, please do so and visit the activation link on your phone. - rodan.simssa.ca
is the server with two instances (more information below).Currently, nothing can be done on this server.There is no popup login window on the homepage. Account creation is normal. Anything except PACO training should be able to run on this server now. It should function the same as rodan2.simssa.ca but everything is significantly faster.
I tested with two small vGPU instance today that it is possible to do something like "distributed computing" with Docker swarm so that we put containers on different instances. Since we do not know when we will get a free spot on Arbutus, I suggest we deploy production Rodan on two smaller instances instead of a single larger one.
- (the original plan) use
g1-16gb-c8-40gb
which has 8 vCPUs and 40GB RAM with a vGPU of 16GM RAM. - (another option) use
g1-8gb-c4-22gb
andp16-16gb
, for example, so that we have all containers exceptGPU-celery
run with 16 vCPUs and 16GB RAM, and theGPU-celery
container has 22GB RAM and 4 vCPUs to share withiipsrv
. Although this vGPU has 8 instead of 16GB RAM, it still seems to perform better than current the staging GPU. - (current stack) use
g1-8gb-c4-22gb
for everything. GPU jobs are fast but non-GPU jobs are really slow because we don't have enough vCPUs.
Option 1 needs 8 vCPUs, and 40GB RAM, while option 2 needs 20 vCPUs and 38GB RAM. Since we are mainly running short of RAM after the extension, this way we actually have more vCPUs to improve performance of other non-GPU containers. We can also use the remaining 2GB RAM to have a separate instance for data storage.
Based on my experiments, we only need to deploy (and later update) the stack on the instance with the manager node and Docker swarm will handle the rest as long as we correctly join the network and label the worker node.
The only trouble I encountered so far is the worker instance for the GPU container needs to access data stored on non-GPU instance. But I believe this is possible. One common practice is using NFS, which I will try this week and report back.
Update on Jun 27, 2024
We already have one g1-16gb-c8-40gb
vGPU instance, but with wrong OS. (Ubuntu 22.04 has wrong DNS resolution in Docker Swarm so we always get redis timeout error, after many trials. Since it will be extremely difficult to launch a new vGPU instance of this flavor and it was purely luck that enabled us to launch this one, I was hoping to make use of this server. However, it cannot work as a worker node. Nor can it be rebuilt with the correct OS.
openstack server rebuild --image 484b2b0c-a9ba-4d8e-b966-5735b5a6f8dc test_rebuild
Attempting to rebuild a volume-backed server using --os-compute-api-version 2.92 or earlier, which will only succeed if the image is identical to the one initially used. This will be an error in a future release.
Image 484b2b0c-a9ba-4d8e-b966-5735b5a6f8dc is unacceptable: Unable to rebuild with a different image for a volume-backed server. (HTTP 400) (Request-ID: req-742195ef-5643-4d96-9d9b-6a1c2977fc23)
As a result, we will just delete this instance because it turns out to be almost useless for us.
-
Current plan:
In case we need a second server for classifying etc while staging is training, I will revert back the current production so that rodan2 can be used and do everything from scratch with new instances and port it to somewhere else. -
Notes for picking the manager instance (without GPU):
We want more vCPUs for all other non GPU jobs to run smoothly. According to Arbutus, instance flavors are named as such
"c" designates "compute", "p" designates "persistent", and "g" designates "vGPU".
"c" flavor is targeted towards CPU intensive tasks, while "p" flavor is geared towards web servers, data base servers and instances that have a lower CPU or bursty CPU usage profile in general.
However, "c" instances are expensive in RAM. If we go with one g1-8gb-c4-22gb
for GPU worker instance, we have around 40GB left. We can only afford c8-30gb-288
among "c" flavors. However, we can still get p16
with RAM options for 16, 24, and 32 GB. I will try with p16-32gb
first because we do not want to waste extra resources to prevent Compute Canada from down grading us for the next year.
We now have rodan2 prod back for all tasks except PACO training with GPU. Distributed prod with one "p" instance and one vGPU instance still needs more testing before NFS can be deployed. There is a "broken pipe" for rodan-main
with the "p" instance now.
Update on July 3, 2024
This is with p16-24gb
as the manager host on Debian 12. I have tested on a few instances and it looks like we need Docker <= 26 to be able to set up the swarm network successfully. The latest Docker 27 leads to errors here and there.
To be able to port url to the manager instance ip and run GPU container with vGPU instance, we need to use the following production.yml
.
version: "3.4"
services:
nginx:
image: "ddmal/nginx:v3.0.0"
deploy:
replicas: 1
resources:
reservations:
cpus: "0.5"
memory: 0.5G
limits:
cpus: "0.5"
memory: 0.5G
restart_policy:
condition: any
delay: 5s
window: 30s
placement:
constraints:
- node.role == manager
healthcheck:
test: ["CMD", "/usr/sbin/service", "nginx", "status"]
interval: "30s"
timeout: "10s"
retries: 10
start_period: "5m"
command: /run/start
environment:
TZ: America/Toronto
SERVER_HOST: rodan.simssa.ca
TLS: 1
ports:
- "80:80"
- "443:443"
- "5671:5671"
- "9002:9002"
volumes:
- "resources:/rodan/data"
rodan-main:
image: "ddmal/rodan-main:v3.0.0"
deploy:
replicas: 1
resources:
reservations:
cpus: "1"
memory: 3G
limits:
cpus: "1"
memory: 3G
restart_policy:
condition: any
delay: 5s
window: 30s
placement:
constraints:
- node.role == manager
healthcheck:
test: ["CMD-SHELL", "/usr/bin/curl -H 'User-Agent: docker-healthcheck' http://localhost:8000/api/?format=json || exit 1"]
interval: "30s"
timeout: "30s"
retries: 5
start_period: "2m"
command: /run/start
environment:
TZ: America/Toronto
SERVER_HOST: rodan.simssa.ca
CELERY_JOB_QUEUE: None
env_file:
- ./scripts/production.env
volumes:
- "resources:/rodan/data"
rodan-client:
image: "ddmal/rodan-client:nightly"
deploy:
placement:
constraints:
- node.role == manager
volumes:
- "./rodan-client/config/configuration.json:/client/configuration.json"
iipsrv:
image: "ddmal/iipsrv:nightly"
volumes:
- "resources:/rodan/data"
celery:
image: "ddmal/rodan-main:v3.0.0"
deploy:
replicas: 1
resources:
reservations:
cpus: "0.8"
memory: 2G
limits:
cpus: "0.8"
memory: 2G
restart_policy:
condition: any
delay: 5s
window: 30s
placement:
constraints:
- node.role == manager
healthcheck:
test: ["CMD", "celery", "inspect", "ping", "-A", "rodan", "--workdir", "/code/Rodan", "-d", "celery@celery", "-t", "30"]
interval: "30s"
timeout: "30s"
start_period: "1m"
retries: 5
command: /run/start-celery
environment:
TZ: America/Toronto
SERVER_HOST: rodan.simssa.ca
CELERY_JOB_QUEUE: celery
env_file:
- ./scripts/production.env
volumes:
- "resources:/rodan/data"
py3-celery:
image: "ddmal/rodan-python3-celery:v3.0.0"
deploy:
replicas: 1
resources:
reservations:
cpus: "3"
memory: 3G
limits:
cpus: "3"
memory: 3G
restart_policy:
condition: any
delay: 5s
window: 30s
placement:
constraints:
- node.role == manager
healthcheck:
test: ["CMD", "celery", "inspect", "ping", "-A", "rodan", "--workdir", "/code/Rodan", "-d", "celery@Python3", "-t", "30"]
interval: "30s"
timeout: "30s"
retries: 5
command: /run/start-celery
environment:
TZ: America/Toronto
SERVER_HOST: rodan.simssa.ca
CELERY_JOB_QUEUE: Python3
env_file:
- ./scripts/production.env
volumes:
- "resources:/rodan/data"
gpu-celery:
image: "ddmal/rodan-gpu-celery:v3.0.0"
deploy:
replicas: 1
resources:
reservations:
cpus: "3"
memory: 18G
limits:
cpus: "3"
memory: 18G
placement:
constraints:
- node.role == worker
restart_policy:
condition: any
delay: 5s
window: 30s
healthcheck:
test: ["CMD", "celery", "inspect", "ping", "-A", "rodan", "--workdir", "/code/Rodan", "-d", "celery@GPU", "-t", "30"]
interval: "30s"
timeout: "30s"
retries: 5
command: /run/start-celery
environment:
TZ: America/Toronto
SERVER_HOST: rodan.simssa.ca
CELERY_JOB_QUEUE: GPU
env_file:
- ./scripts/production.env
volumes:
- "resources:/rodan/data"
redis:
image: "redis:alpine"
deploy:
replicas: 1
resources:
reservations:
cpus: "1"
memory: 2G
limits:
cpus: "1"
memory: 2G
restart_policy:
condition: any
delay: 5s
window: 30s
placement:
constraints:
- node.role == manager
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 10s
timeout: 5s
retries: 5
environment:
TZ: America/Toronto
postgres:
image: "ddmal/postgres-plpython:v3.0.0"
deploy:
replicas: 1
endpoint_mode: dnsrr
resources:
reservations:
cpus: "2"
memory: 2G
limits:
cpus: "2"
memory: 2G
restart_policy:
condition: any
delay: 5s
window: 30s
placement:
constraints:
- node.role == manager
healthcheck:
test: ["CMD-SHELL", "pg_isready", "-U", "postgres"]
interval: 10s
timeout: 5s
retries: 5
environment:
TZ: America/Toronto
volumes:
- "pg_data:/var/lib/postgresql/data"
- "pg_backup:/backups"
env_file:
- ./scripts/production.env
rabbitmq:
image: "rabbitmq:alpine"
deploy:
replicas: 1
resources:
reservations:
cpus: "1"
memory: 4G
limits:
cpus: "1"
memory: 4G
restart_policy:
condition: any
delay: 5s
window: 30s
placement:
constraints:
- node.role == manager
healthcheck:
test: ["CMD", "rabbitmq-diagnostics", "-q", "ping"]
interval: "30s"
timeout: "3s"
retries: 3
environment:
TZ: America/Toronto
env_file:
- ./scripts/production.env
volumes:
resources:
pg_backup:
pg_data:
Essentially we have only gpu-celery
on the worker instance. We cannot specify deploy constraint for iipsrv
so it might be deployed on the worker instance as well. All other instances should be on the manager node with sufficient vGPU and RAM.
This server is up on rodan.simssa.ca. The login popup window on homepage seems gone as well given larger cpu and memory for each container.
Do not use Docker version >= 27. Only use Docker <= 26.
Remaining issues:
- set up NFS (or other data sharing methods) for all jobs to run
- make sure we can do training with GPU
The remaining issue is in #1181.
To set up NFS, I followed this guide and did everything except ufw (firewall), which we might set up later if needed.
I directly mounted /var/lib/docker/volumes/
on both instances and restarted gpu
container.
This is what's inside /etc/exports
on the manager instance.
/var/lib/docker/volumes [worker node public ip](rw,sync,no_subtree_check,no_root_squash)
(Need sudo systemctl restart nfs-kernel-server
after editing.)
Good practice to detach and delete a worker instance (all on the worker instance):
- leave the docker swarm network
docker swarm leave --force
- unmount the nfs directory
sudo umount /var/lib/docker/volumes/