grnet/ganetimgr

Instance creation failures not handled

Closed this issue · 4 comments

If you approve a new instance application, but the instance creation fails, then:

  • the web interface returns a 500 error (*)
  • the application moves into the "completed applications" section (although its status is still "submitted")
  • if you click on the application you can see its parameters, but not edit them
  • if you click on the hostname within the application, you get a 404 error (presumably because ganetimgr thinks the instance exists, but it doesn't)

To reproduce: any type of event which causes instance creation to fail will do, e.g. not having enough RAM on the cluster. However an easy-to-reproduce case is where the os-type does not exist, e.g. if you select the "noop" os type but this is not installed.

Desired behaviour: ideally leave the application queued and report the error somewhere (e.g. in the "comments to the user" box), and allow the parameters to be edited and resubmitted. Or else treat it as a rejection, and append the error in the "comments to the user"

(*) ganetimgr.log shows:

2014-03-14 00:44:34 [15454] [DEBUG] POST /application/3/review
ERROR:django.request:Internal Server Error: /application/3/review
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/django/core/handlers/base.py", line 111, in get_response
    response = callback(request, *callback_args, **callback_kwargs)
  File "/usr/lib/python2.7/dist-packages/django/contrib/auth/decorators.py", line 20, in _wrapped_view
    return view_func(request, *args, **kwargs)
  File "/srv/www/ganetimgr/apply/views.py", line 151, in review_application
    application.submit()
  File "/srv/www/ganetimgr/apply/models.py", line 187, in submit
    b = beanstalkc.Connection()
  File "/srv/www/ganetimgr/util/beanstalkc.py", line 49, in __init__
    self.connect()
  File "/srv/www/ganetimgr/util/beanstalkc.py", line 63, in connect
    raise SocketError(e)
SocketError: [Errno 111] Connection refused
2014-03-14 00:44:40 [15454] [DEBUG] Closing connection.
DEBUG:gunicorn.error:Closing connection.

Creation failure is certainly not equivalent to rejection.
But this is certainly an issue we also face and we need to remedy...

In the previous example this was due to beanstalkd not running. In another test setup I get the same issue if the OS creation script is unable to run for some reason (see below).

The error message is visible via "gnt-job info" so presumably is available via RAPI, but the job just remains in "submitted" state in the ganetimgr GUI.

root@s1:~# gnt-job list
...
5481 error   INSTANCE_CREATE(host90.ws.nsrc.org)
root@s1:~# gnt-job show 5481
Job ID: 5481
  Status: error
  Received:         2014-05-18 21:46:37.795745
  Processing start: 2014-05-18 21:46:37.880809 (delta 0.085064s)
  Processing end:   2014-05-18 21:46:38.848081 (delta 0.967272s)
  Total processing time: 1.052336 seconds
  Opcodes:
    OP_INSTANCE_CREATE
      Status: error
      Processing start: 2014-05-18 21:46:37.880809
      Execution start:  2014-05-18 21:46:37.965838
      Processing end:   2014-05-18 21:46:38.848045
      Input fields:
        beparams: {'minmem': 512, 'vcpus': 1, 'maxmem': 512}
        comment: None
        conflicts_check: True
        debug_level: 0
        depends: None
        disk_template: plain
        disks: {'size': 5000}
        dry_run: False
        force_variant: False
        hvparams: {}
        hypervisor: kvm
        iallocator: hail
        identify_defaults: False
        ignore_ipolicy: False
        instance_name: host90.ws.nsrc.org
        ip_check: False
        mode: create
        name_check: False
        nics: {'link': 'br-svc', 'mode': 'bridged'}
        opportunistic_locking: False
        os_type: snf-image+default
        osparams: {'img_id': 'debian-wheezy', 'img_ssh_key_url': 'http://example.com/application/1/XZd4sNTqPT/ssh_keys', 'img_format': 'tarball'}
        pnode: s2.ws.nsrc.org
        pnode_uuid: d892aabd-37b5-42d9-9690-0d78abc1e528
        priority: 0
        reason: ['gnt:library:rlib2:instances', '', 1400427997715827968],['gnt:opcode:op_instance_create', 'job=5481;index=0', 1400427997795738880]
        source_shutdown_timeout: 120
        start: False
        tags: ganetimgr:user:nsrc,ganetimgr:application:1
        wait_for_sync: False
      Result:
        OpExecError
        [OS Parameters validation failed on node s2.ws.nsrc.org: The following parameters are not supported by the OS snf-image: img_ssh_key_url]
      Execution log:
        1:2014-05-18 21:46:38.459928:message  - INFO: Selected nodes for instance host90.ws.nsrc.org via iallocator hail: s2.ws.nsrc.org

I have investigated further and it seems that all the infrastructure is in place for this: the job_id is stored in the applications table, and the watcher has code to monitor the creation status. It seems that something in the watcher is preventing jobs from advancing from STATUS_SUBMITTED to STATUS_FAILED if the creation fails, and I need to investigate further. (Maybe it's just an exception in mail_admins means it doesn't advance to application.save() ?)

As for the web interface, it currently divides applications into two classes:

    pending = applications.filter(status=STATUS_PENDING)
    completed = applications.exclude(status=STATUS_PENDING)

I think that it would be helpful if applications in STATUS_FAILED or STATUS_APPROVED could be resubmitted - i.e. to retry the creation of the VM once the underlying problem has been resolved - possibly with some editing of the parameters first. Essentially they could appear just like new requests, perhaps relabelling the "Accept" button to "Resubmit", so I think it's an easy change.

It looks like only minor other changes required, e.g.

    def submit(self):
        if self.status != STATUS_APPROVED:
            raise ApplicationError("Invalid application status %d" %
                                   self.status)

would also have to allow jobs in state STATUS_FAILED.

(Note: it is possible to have an application in state APPROVED but not SUBMITTED if there is an error in the python code which builds the request for submission in apply/models.py)

(Aside: some VM creation failures can leave the VM in place, especially errors during OS installation, so it may be necessary to remove the instance first)

Pull request merged (6bfdf21)

Thnx again Brian