Instance creation failures not handled
Closed this issue · 4 comments
If you approve a new instance application, but the instance creation fails, then:
- the web interface returns a 500 error (*)
- the application moves into the "completed applications" section (although its status is still "submitted")
- if you click on the application you can see its parameters, but not edit them
- if you click on the hostname within the application, you get a 404 error (presumably because ganetimgr thinks the instance exists, but it doesn't)
To reproduce: any type of event which causes instance creation to fail will do, e.g. not having enough RAM on the cluster. However an easy-to-reproduce case is where the os-type does not exist, e.g. if you select the "noop" os type but this is not installed.
Desired behaviour: ideally leave the application queued and report the error somewhere (e.g. in the "comments to the user" box), and allow the parameters to be edited and resubmitted. Or else treat it as a rejection, and append the error in the "comments to the user"
(*) ganetimgr.log shows:
2014-03-14 00:44:34 [15454] [DEBUG] POST /application/3/review
ERROR:django.request:Internal Server Error: /application/3/review
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/django/core/handlers/base.py", line 111, in get_response
response = callback(request, *callback_args, **callback_kwargs)
File "/usr/lib/python2.7/dist-packages/django/contrib/auth/decorators.py", line 20, in _wrapped_view
return view_func(request, *args, **kwargs)
File "/srv/www/ganetimgr/apply/views.py", line 151, in review_application
application.submit()
File "/srv/www/ganetimgr/apply/models.py", line 187, in submit
b = beanstalkc.Connection()
File "/srv/www/ganetimgr/util/beanstalkc.py", line 49, in __init__
self.connect()
File "/srv/www/ganetimgr/util/beanstalkc.py", line 63, in connect
raise SocketError(e)
SocketError: [Errno 111] Connection refused
2014-03-14 00:44:40 [15454] [DEBUG] Closing connection.
DEBUG:gunicorn.error:Closing connection.
Creation failure is certainly not equivalent to rejection.
But this is certainly an issue we also face and we need to remedy...
In the previous example this was due to beanstalkd not running. In another test setup I get the same issue if the OS creation script is unable to run for some reason (see below).
The error message is visible via "gnt-job info" so presumably is available via RAPI, but the job just remains in "submitted" state in the ganetimgr GUI.
root@s1:~# gnt-job list
...
5481 error INSTANCE_CREATE(host90.ws.nsrc.org)
root@s1:~# gnt-job show 5481
Job ID: 5481
Status: error
Received: 2014-05-18 21:46:37.795745
Processing start: 2014-05-18 21:46:37.880809 (delta 0.085064s)
Processing end: 2014-05-18 21:46:38.848081 (delta 0.967272s)
Total processing time: 1.052336 seconds
Opcodes:
OP_INSTANCE_CREATE
Status: error
Processing start: 2014-05-18 21:46:37.880809
Execution start: 2014-05-18 21:46:37.965838
Processing end: 2014-05-18 21:46:38.848045
Input fields:
beparams: {'minmem': 512, 'vcpus': 1, 'maxmem': 512}
comment: None
conflicts_check: True
debug_level: 0
depends: None
disk_template: plain
disks: {'size': 5000}
dry_run: False
force_variant: False
hvparams: {}
hypervisor: kvm
iallocator: hail
identify_defaults: False
ignore_ipolicy: False
instance_name: host90.ws.nsrc.org
ip_check: False
mode: create
name_check: False
nics: {'link': 'br-svc', 'mode': 'bridged'}
opportunistic_locking: False
os_type: snf-image+default
osparams: {'img_id': 'debian-wheezy', 'img_ssh_key_url': 'http://example.com/application/1/XZd4sNTqPT/ssh_keys', 'img_format': 'tarball'}
pnode: s2.ws.nsrc.org
pnode_uuid: d892aabd-37b5-42d9-9690-0d78abc1e528
priority: 0
reason: ['gnt:library:rlib2:instances', '', 1400427997715827968],['gnt:opcode:op_instance_create', 'job=5481;index=0', 1400427997795738880]
source_shutdown_timeout: 120
start: False
tags: ganetimgr:user:nsrc,ganetimgr:application:1
wait_for_sync: False
Result:
OpExecError
[OS Parameters validation failed on node s2.ws.nsrc.org: The following parameters are not supported by the OS snf-image: img_ssh_key_url]
Execution log:
1:2014-05-18 21:46:38.459928:message - INFO: Selected nodes for instance host90.ws.nsrc.org via iallocator hail: s2.ws.nsrc.org
I have investigated further and it seems that all the infrastructure is in place for this: the job_id is stored in the applications table, and the watcher has code to monitor the creation status. It seems that something in the watcher is preventing jobs from advancing from STATUS_SUBMITTED to STATUS_FAILED if the creation fails, and I need to investigate further. (Maybe it's just an exception in mail_admins means it doesn't advance to application.save() ?)
As for the web interface, it currently divides applications into two classes:
pending = applications.filter(status=STATUS_PENDING)
completed = applications.exclude(status=STATUS_PENDING)
I think that it would be helpful if applications in STATUS_FAILED or STATUS_APPROVED could be resubmitted - i.e. to retry the creation of the VM once the underlying problem has been resolved - possibly with some editing of the parameters first. Essentially they could appear just like new requests, perhaps relabelling the "Accept" button to "Resubmit", so I think it's an easy change.
It looks like only minor other changes required, e.g.
def submit(self):
if self.status != STATUS_APPROVED:
raise ApplicationError("Invalid application status %d" %
self.status)
would also have to allow jobs in state STATUS_FAILED.
(Note: it is possible to have an application in state APPROVED but not SUBMITTED if there is an error in the python code which builds the request for submission in apply/models.py)
(Aside: some VM creation failures can leave the VM in place, especially errors during OS installation, so it may be necessary to remove the instance first)