galaxyproject/cloudman

Decouple toggling master as exec host from autoscaling

Closed this issue · 20 comments

We are exploring an option of running clusters on demand with a large master node (for better EBS/NFS throughput). Since these large master nodes have a few spare CPU cores available, we would like to allow the master to run jobs.

However, it appears to be impossible to switch the master to run jobs when autoscaling is enabled (message in the Admin console when clicking on "Switch master to run jobs": "Master is not an execution host"; message in the Cluster info log: "The master instance is set to not execute jobs. To manually change this, use the CloudMan Admin panel.").

Would it be possible to decouple toggling master as exec host from autoscaling?

Can you try it now? I've made the change you suggested. Disabling the master was a deliberate action so I didn't commit the change yet because I'd like to get some feedback from you about the experience first.

OK, thanks. Will I get the update by specifying cloudman-dev bucket when launching a cluster?

Looks like that didn't work. I've launched a saved cluster using cloudman-dev bucket option, enabled autoscaling and this was displayed displayed in the cluster info log:

16:08:38 - The master instance is set to *not* execute jobs. To manually change this, use the CloudMan Admin panel.
16:08:38 - AS service prerequisites OK; starting the service.

It won't work with a saved cluster, only new clusters. You could download cm.tar.gz from the cloudman-dev bucket and manually place it into your saved cluster's bucket, then restart the cluster and changes should take effect. Also, by default, the master will still be changed to not execute jobs once Autoscaling is enabled but clicking that toggle link on the Admin page will re-enable it to run jobs. I figured most of the people would prefer that option so it's the default.

BTW, when you choose the 16.10 image, the cloudman-dev bucket is the default so you don't need to specify if.

I've created a new cluster using the cloudman-dev bucket but I'm now unable to add admin users via the CloudMan Admin Console. The email address is accepted and Galaxy is restarted but no admin rights are granted to the user. The galaxy.ini file doesn't contain admin_users setting. Cluster info log contains:

22:24:13 - Completed the initial cluster startup process. Configuring a previously existing cluster of type Galaxy.
22:24:29 - Adding volume vol-0d0763a423497b8fb (galaxy FS)...
22:24:48 - Supervisor service prerequisites OK; starting the service.
22:24:48 - Migration service prerequisites OK; starting the service.
22:24:54 - Postgres service prerequisites OK; starting the service.
22:24:55 - ProFTPd service prerequisites OK; starting the service.
22:24:56 - Slurmctld service prerequisites OK; starting the service.
22:24:58 - Galaxy service prerequisites OK; starting the service.
22:25:07 - Slurmd service prerequisites OK; starting the service.
22:25:35 - Galaxy service state changed from 'Starting' to 'Running'
22:25:36 - All cluster services started; the cluster is ready for use.
22:27:02 - Received following list of admin users: 'refinery@stemcellcommons.org'
22:27:02 - Galaxy admins added: [u'refinery@stemcellcommons.org']; restarting Galaxy
22:27:02 - Restarting Galaxy service
22:27:02 - Removing 'Galaxy' service
22:27:02 - Shutting down Galaxy...
22:27:03 - Removing 'Galaxy' service
22:27:03 - Shutting down Galaxy...
22:27:25 - Galaxy service state changed from 'Starting' to 'Running'

cloudman.log:

2016-11-01 22:27:02,234 INFO            root:602  Received following list of admin users: 'refinery@stemcellcommons.org'
2016-11-01 22:27:02,234 INFO            root:616  Galaxy admins added: [u'refinery@stemcellcommons.org']; restarting Galaxy
2016-11-01 22:27:02,234 INFO          galaxy:74   Restarting Galaxy service
2016-11-01 22:27:02,235 INFO          galaxy:64   Removing 'Galaxy' service
2016-11-01 22:27:02,235 DEBUG       __init__:408  Setting service Galaxy as not `activated`
2016-11-01 22:27:02,235 DEBUG         galaxy:69   Resetting Galaxy remaining_start_attempts to 2.
2016-11-01 22:27:02,245 DEBUG         galaxy:90   Using Galaxy from '/mnt/galaxy/galaxy-app'
2016-11-01 22:27:02,265 INFO          galaxy:166  Shutting down Galaxy...
2016-11-01 22:27:03,381 DEBUG         master:2770 Monitor stopping service 'Galaxy' in state 'Shutting down'
2016-11-01 22:27:03,381 INFO          galaxy:64   Removing 'Galaxy' service
2016-11-01 22:27:03,381 DEBUG       __init__:408  Setting service Galaxy as not `activated`
2016-11-01 22:27:03,381 DEBUG         galaxy:69   Resetting Galaxy remaining_start_attempts to 2.
2016-11-01 22:27:03,392 DEBUG         galaxy:90   Using Galaxy from '/mnt/galaxy/galaxy-app'
2016-11-01 22:27:03,411 INFO          galaxy:166  Shutting down Galaxy...
2016-11-01 22:27:03,431 DEBUG       __init__:53   'galaxy' daemon is NOT running any more (expected pid: '3228').
2016-11-01 22:27:03,431 DEBUG         galaxy:273  Galaxy UI does not seem to be accessible.
2016-11-01 22:27:03,451 DEBUG       __init__:53   'galaxy' daemon is NOT running any more (expected pid: '3228').
2016-11-01 22:27:03,451 DEBUG         galaxy:273  Galaxy UI does not seem to be accessible.
2016-11-01 22:27:03,451 DEBUG         galaxy:173  Galaxy not running; setting service state to SHUT_DOWN.
2016-11-01 22:27:03,461 DEBUG         master:2652 Storing cluster configuration to cluster's bucket
2016-11-01 22:27:03,659 DEBUG           misc:625  Saved file 'persistent_data-current.yaml' of size 763B as 'persistent_data.yaml' to bucket 'cm-7299080057c55c0e6b974c2d532180bd'
2016-11-01 22:27:03,659 DEBUG         master:2668 Saving current instance boot script (/opt/cloudman/boot/cm_boot.py) to cluster bucket 'cm-7299080057c55c0e6b974c2d532180bd' as 'cm_boot.py'
2016-11-01 22:27:03,709 DEBUG           misc:625  Saved file '/opt/cloudman/boot/cm_boot.py' of size 24019B as 'cm_boot.py' to bucket 'cm-7299080057c55c0e6b974c2d532180bd'
2016-11-01 22:27:03,709 DEBUG         master:2675 Saving CloudMan source (/mnt/cm/cm.tar.gz) to cluster bucket 'cm-7299080057c55c0e6b974c2d532180bd' as 'cm.tar.gz'
2016-11-01 22:27:03,936 DEBUG           misc:850  '/bin/su - galaxy -c "export GALAXY_HOME='/mnt/galaxy/galaxy-app'; export TMPDIR='/mnt/galaxy/tmp'; export TEMP='/mnt/galaxy/tmp'; source $GALAXY_HOME/.venv/bin/activate; sh $GALAXY_HOME/run.sh --pid-file=main.pid --log-file=main.log --stop-daemon"' command OK
2016-11-01 22:27:03,947 DEBUG         galaxy:273  Galaxy UI does not seem to be accessible.
2016-11-01 22:27:03,947 DEBUG         galaxy:173  Galaxy not running; setting service state to SHUT_DOWN.
2016-11-01 22:27:03,969 DEBUG         galaxy:273  Galaxy UI does not seem to be accessible.
2016-11-01 22:27:03,969 DEBUG         galaxy:60   Service Galaxy self-activated
2016-11-01 22:27:03,980 DEBUG         galaxy:90   Using Galaxy from '/mnt/galaxy/galaxy-app'
2016-11-01 22:27:04,000 DEBUG     decorators:83   Delay trigger not met (delta: 0; delay: 10. skipping method cm.services.apps.galaxy->GalaxyService.status
2016-11-01 22:27:04,010 DEBUG         galaxy:273  Galaxy UI does not seem to be accessible.
2016-11-01 22:27:04,010 DEBUG         galaxy:146  Starting Galaxy...
2016-11-01 22:27:04,017 DEBUG    galaxy_conf:34   Attemping to chown to galaxy for /mnt/galaxy/tmp
2016-11-01 22:27:04,047 DEBUG    galaxy_conf:195  Rewriting Galaxy's main config file: /mnt/galaxy/galaxy-app/config/galaxy.ini
2016-11-01 22:27:04,048 DEBUG    galaxy_conf:34   Attemping to chown to galaxy for /mnt/galaxy/galaxy-app/config/galaxy.ini
2016-11-01 22:27:04,062 DEBUG           misc:625  Saved file '/mnt/cm/cm.tar.gz' of size 1114739B as 'cm.tar.gz' to bucket 'cm-7299080057c55c0e6b974c2d532180bd'
2016-11-01 22:27:04,062 DEBUG         master:2698 Saving '/mnt/cm/cm-16.10-dev-as.clusterName' file to cluster bucket 'cm-7299080057c55c0e6b974c2d532180bd' as 'cm-16.10-dev-as.clusterName'
2016-11-01 22:27:04,093 DEBUG           misc:625  Saved file '/mnt/cm/cm-16.10-dev-as.clusterName' of size 0B as 'cm-16.10-dev-as.clusterName' to bucket 'cm-7299080057c55c0e6b974c2d532180bd'
2016-11-01 22:27:05,745 DEBUG           misc:850  '/bin/su - galaxy -c "export GALAXY_HOME='/mnt/galaxy/galaxy-app'; export TMPDIR='/mnt/galaxy/tmp'; export TEMP='/mnt/galaxy/tmp'; source $GALAXY_HOME/.venv/bin/activate; sh $GALAXY_HOME/run.sh --pid-file=main.pid --log-file=main.log --daemon"' command OK
2016-11-01 22:27:05,805 DEBUG            cmd:584  Popen(['git', 'cat-file', '--batch-check'], cwd=/mnt/galaxy/galaxy-app, universal_newlines=False, shell=None)
2016-11-01 22:27:05,812 DEBUG            cmd:584  Popen(['git', 'cat-file', '--batch'], cwd=/mnt/galaxy/galaxy-app, universal_newlines=False, shell=None)
2016-11-01 22:27:06,906 DEBUG            cmd:584  Popen(['git', 'cat-file', '--batch-check'], cwd=/mnt/galaxy/galaxy-app, universal_newlines=False, shell=None)
2016-11-01 22:27:06,913 DEBUG            cmd:584  Popen(['git', 'cat-file', '--batch'], cwd=/mnt/galaxy/galaxy-app, universal_newlines=False, shell=None)
2016-11-01 22:27:09,431 DEBUG          nginx:263  Nginx service detected a change in proxied services; reconfiguring the nginx config (active proxied: []; active: []).
2016-11-01 22:27:09,431 DEBUG          nginx:144  Updating Nginx config at /etc/nginx/nginx.conf
2016-11-01 22:27:09,437 DEBUG           misc:878  Executed command '/usr/sbin/nginx -v' and got output: 'nginx version: nginx/1.4.6 (Ubuntu)'
2016-11-01 22:27:09,438 DEBUG          nginx:164  Using Nginx v1.4+ template
2016-11-01 22:27:09,438 DEBUG          nginx:128  Wrote Nginx config file /etc/nginx/nginx.conf
2016-11-01 22:27:09,438 DEBUG          nginx:128  Wrote Nginx config file /etc/nginx/sites-enabled/default.server
2016-11-01 22:27:09,439 DEBUG          nginx:128  Wrote Nginx config file /etc/nginx/sites-enabled/default.locations
2016-11-01 22:27:09,439 DEBUG          nginx:128  Wrote Nginx config file /etc/nginx/sites-enabled/galaxy.locations
2016-11-01 22:27:09,483 DEBUG           misc:850  '/usr/sbin/nginx -c /etc/nginx/nginx.conf -s reload' command OK
2016-11-01 22:27:09,531 DEBUG     decorators:83   Delay trigger not met (delta: 5; delay: 10. skipping method cm.services.apps.galaxy->GalaxyService.status
2016-11-01 22:27:09,532 DEBUG         master:2868 S&S: AS..Unstarted; ClouderaManager..Unstarted; Cloudgene..Unstarted; Galaxy..Starting; GalaxyReports..Unstarted; Migration..Completed; Nginx..OK; NodeJSProxy..Unstarted; PSS..Completed; Postgres..OK; ProFTPd..OK; Pulsar..Unstarted; Slurmctld..OK; Slurmd..OK; Supervisor..OK; galaxy FS..OK; galaxyIndices FS..OK; transient_nfs FS..OK;
2016-11-01 22:27:15,079 DEBUG            cmd:584  Popen(['git', 'cat-file', '--batch-check'], cwd=/mnt/galaxy/galaxy-app, universal_newlines=False, shell=None)
2016-11-01 22:27:15,087 DEBUG            cmd:584  Popen(['git', 'cat-file', '--batch'], cwd=/mnt/galaxy/galaxy-app, universal_newlines=False, shell=None)
2016-11-01 22:27:23,068 DEBUG            cmd:584  Popen(['git', 'cat-file', '--batch-check'], cwd=/mnt/galaxy/galaxy-app, universal_newlines=False, shell=None)
2016-11-01 22:27:23,075 DEBUG            cmd:584  Popen(['git', 'cat-file', '--batch'], cwd=/mnt/galaxy/galaxy-app, universal_newlines=False, shell=None)
2016-11-01 22:27:25,394 INFO          galaxy:245  Galaxy service state changed from 'Starting' to 'Running'

Yep, there was/is a bug there when admins are set via user data so in the process of trying to fix this, I've added only more bugs... At this point it's WIP. Thanks for trying this out and reporting them.

I believe this is fixed now. Latest code is in the cloudman-dev bucket.

OK, thanks. Btw, it was me who originally requested preventing the master from running jobs by default when autoscaling is enabled (#40) :) However, if others found it useful I don't mind leaving it as is. It looks like BioBlend allows setting master as an exec host, so it's just a matter of one extra API call.

I've successfully tested adding admin users and setting master as exec host via the admin console using the Dev 11/01 flavor. However, I couldn't start a cluster using the Dev flavor with the following error message in the CloudMan log:

2016-11-03 20:32:43,231 ERROR     supervisor:155  Fault starting supervisord prog: <Fault 50: 'SPAWN_ERROR: galaxy_nodejs_proxy'>

Use the default Dev 11/01 flavor. The other ones are defunct at this point.

We'll do the full circle then on the auto-scaling piece and revert all of it :)

Thank you! So, I've started a new cluster (Dev 11/01 flavor), enabled AS and started a couple of jobs to get a worker online but when it did the master was set to not run jobs:

15:33:22 - AS service prerequisites OK; starting the service.
15:37:48 - The master instance is set to *not* execute jobs. To manually change this, use the CloudMan Admin panel.
15:37:48 - Adding 1 on-demand instance(s)

and then:

15:58:45 - Instance 'i-0d91d569ab6c4c11b' removed from the internal instance list.
15:58:45 - The master instance is set to execute jobs. To manually change this, use the CloudMan Admin panel.

I would like to have autoscaling and setting master as exec host to be completely independent from each other.

This is getting more involved because the scenario you describe has little to do with auto-scaling alone. What's triggering the removal of master in your scenario is the worker being added, which is the same process regardless if being done manually or via auto-scaling. So changing that behavior would have a big downstream effect that's probably undesirable by the majority of users.

One potential option would be to go back to needing to explicitly set the master to be an exec host and have that setting be persistent (it currently resets each time a worker is added). I remember there being some annoying implementation detail there that made this overly complicated to realize but if that sounds like a sensible compromise, I'll look into it again?

OK, thank you. I think making master exec host setting persistent would help with better CPU utilization in larger master nodes and make the configuration more flexible. However, if it is difficult to implement it might not be worth pursuing.

I've poked around with this change and I believe it behaves as we discussed. Let me know if that's not the case for you; the code is available from the cloudman-dev bucket.

Thanks. I've created a new cluster (Dev 11/01 flavor), enabled AS and ran a workflow but AS didn't spin up a worker node (#60 (comment)). This time it was with the stock job_conf.xml.
Galaxy log:

galaxy.jobs.runners.drmaa DEBUG 2016-11-08 21:44:38,266 (9) native specification is: --nodes=1 --ntasks=4
galaxy.jobs.runners.drmaa INFO 2016-11-08 21:44:38,267 (9) queued as 10
galaxy.jobs DEBUG 2016-11-08 21:44:38,267 (9) Persisting job destination (destination id: slurm_cluster_cpu4)
galaxy.jobs.runners.drmaa DEBUG 2016-11-08 21:44:38,863 (9/10) state change: job is queued and active

CM log:

2016-11-08 21:44:28,631 DEBUG      autoscale:151  Checking if cluster too SMALL: minute:44,idle:0,total workers:0,avail workers:0,min:0,max:2
2016-11-08 21:44:28,680 DEBUG      autoscale:179  Checking if slow job turnover: queued jobs: 0, avg runtime: 22
2016-11-08 21:44:30,017 DEBUG         master:2865 S&S: AS..OK; ClouderaManager..Unstarted; Cloudgene..Unstarted; Galaxy..OK; GalaxyReports..Shut down; Migration..Completed; Nginx..OK; NodeJSProxy..OK; PSS..Completed; Postgres..OK; ProFTPd..OK; Pulsar..Unstarted; Slurmctld..OK; Slurmd..OK; Supervisor..OK; galaxy FS..OK; galaxyIndices FS..OK; transient_nfs FS..OK;
2016-11-08 21:45:00,055 DEBUG      autoscale:151  Checking if cluster too SMALL: minute:45,idle:0,total workers:0,avail workers:0,min:0,max:2
2016-11-08 21:45:00,083 DEBUG      autoscale:179  Checking if slow job turnover: queued jobs: 0, avg runtime: 0
2016-11-08 21:45:01,341 DEBUG         master:2865 S&S: AS..OK; ClouderaManager..Unstarted; Cloudgene..Unstarted; Galaxy..OK; GalaxyReports..Shut down; Migration..Completed; Nginx..OK; NodeJSProxy..OK; PSS..Completed; Postgres..OK; ProFTPd..OK; Pulsar..Unstarted; Slurmctld..OK; Slurmd..OK; Supervisor..OK; galaxy FS..OK; galaxyIndices FS..OK; transient_nfs FS..OK;
2016-11-08 21:45:31,375 DEBUG      autoscale:151  Checking if cluster too SMALL: minute:45,idle:0,total workers:0,avail workers:0,min:0,max:2
2016-11-08 21:45:31,398 DEBUG      autoscale:179  Checking if slow job turnover: queued jobs: 0, avg runtime: 0
2016-11-08 21:45:32,139 DEBUG         master:2865 S&S: AS..OK; ClouderaManager..Unstarted; Cloudgene..Unstarted; Galaxy..OK; GalaxyReports..Shut down; Migration..Completed; Nginx..OK; NodeJSProxy..OK; PSS..Completed; Postgres..OK; ProFTPd..OK; Pulsar..Unstarted; Slurmctld..OK; Slurmd..OK; Supervisor..OK; galaxy FS..OK; galaxyIndices FS..OK; transient_nfs FS..OK;

SLURM:

galaxy@ip-172-31-10-123:~$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                10      main g9_trimm   galaxy PD       0:00      1 (ReqNodeNotAvail)

Also, adding a worker node manually configures master to not execute jobs:

22:37:22 - Removing 'Autoscale' service
22:38:25 - The master instance is set to *not* execute jobs. To manually change this, use the CloudMan Admin panel.
22:38:25 - Adding 1 on-demand instance(s)

Correct. That's the expected default behavior. This is what I was saying yesterday that by default the master will be set to not execute jobs as soon as a worker is added. However, once that option is manually changed, the change will persist.

Thanks, I see now. It does work as you described. However, a new cluster starts with the master set to run jobs, so to make it sticky I have to unset it then set it again. Perhaps there could be an "auto" and "manual" modes added for this setting? In the auto mode it'd be set by adding/removing workers and in the manual mode it'll be set via API or admin console.

The new release candidate looks great and I can't wait to put it in production but I was wondering if it would be possible to add a UI element (a check box?) to the admin console to set self.app.manager.keep_master_exec_host = True to avoid unsetting and setting master to run jobs every time a new cluster is started.

How about doing it though the user data supplied when launching an instance? For example
keep_master_exec_host: True

Thanks, I didn't realize I could manipulate the config like that. However, that didn't work:

16:45:41 - The master instance is set to *not* execute jobs. To manually change this, use the CloudMan Admin panel.
16:45:41 - Adding 1 on-demand instance(s)