allegroai/clearml-agent

Understanding ClearML K8s integration

elgalu opened this issue · 2 comments

From what I understood so far:

  1. the "docker mode" via K8s siblings allows users to change the default container image depending on their needs and avoids ClearML from creating a venv because the packages could already be in the container image? if the agent is started with --standalone-mode ?

  2. Let's say we have servers with 8 GPUs each; it seems not to be possible for a user to specify, at the experiment level, that they'll only need 2 GPUs for a specific training. So this can only be controlled at the queue level? So to support all possibilities we would need to create 8 dynamic gpu queues that the user can choose from, or?

  3. How does the ClearML auto scaling features connect to the K8s choice? I guess they can't be used together because K8s auto-scales by default?

Hi @elgalu,

I assume you're referring to the open-source agent support for k8s (the scale and enterprise support contains more options)

the "docker mode" via K8s siblings allows users to change the default container image depending on their needs and avoids ClearML from creating a venv because the packages could already be in the container image? if the agent is started with --standalone-mode

In the k8s glue mode, the agent is not started as usual, but using the k8s example script, thus there's no docker switch or standalone switch. The agent will use the docker image specified on the task, or the default docker image (which can be configured in the agent configuration)

Let's say we have servers with 8 GPUs each; it seems not to be possible for a user to specify, at the experiment level, that they'll only need 2 GPUs for a specific training. So this can only be controlled at the queue level? So to support all possibilities we would need to create 8 dynamic gpu queues that the user can choose from, or?

That's mostly right. You can basically use a queue to provide an appropriate k8s template in an associated agent running in a k8s glue mode

How does the ClearML auto scaling features connect to the K8s choice? I guess they can't be used together because K8s auto-scales by default?

You are correct - the k8s glue option is basically an auto-scaling solution, no need for the other scale options :)

Hi @jkhenning for 1. what more options does the Enterprise option include that are relevant to 1. ? I got lost there, thanks.