DistributedSystemsGroup/zoe

Launch container on a GPU server

Closed this issue · 3 comments

Hello,

I am trying to have a jupyter Zapp launched on a GPU server using the labels. I have the docker-engine backend with this entry in the conf file:

[gpu01]
docker_address: xx.yy.zz.tt.gg:2375
external_address: xx.yy.zz.tt.gg
use_tls: yes
labels: gpu

In the Zapp json we have:

.....
"services": [
{
"labels": [
"gpu"
],
"image": xxxxxxx
......

I have set scheduler-class = ZoeElasticScheduler

The node gpu01 is seen online:

2017-12-29 13:32:37,101 INFO synchro_gpu01->zoe_master.backends.docker.threads: Node gpu01 is now online

The Zapp is not starting with this INFO messages:

2017-12-29 13:33:37,227 INFO scheduler->zoe_master.scheduler.simulated_platform: Cannot fit essential service 7 anywhere, bailing out

An important point I had to modify the code of the file zoe_master/backends/docker/threads.py line 78
self.host_stats[host_config.name].labels = set(info['Labels'])
instead of
self.host_stats[host_config.name].labels += set(info['Labels'])
as I was getting this error message:

TypeError: unsupported operand type(s) for +=: 'set' and 'set'

The Zapp is starting well if I remove the labels entry from the json

Can you give me some help I'm stuck, thanks.

Best regards,
Thomas

Hi,
I assumed that sets can be added to perform a union, but I see that you cannot. The line 78 need to be changed liked this:

self.host_stats[host_config.name].labels.union(set(info['Labels']))

By using = you are throwing away the labels in the config file and keeping only the ones defined by the docker engine.

I will merge a fix soon.

Thanks!

Hi,
Thanks it is working now.

Fixed in commit ef2a2b7