Server showing node as unusable
vidit-bhatia opened this issue · 14 comments
My server is showing the nodes as unusable . The server use to work just few days back . I tried going on to the node as described in #9 .
There was nothing mentioned other than that there is an internal error. Is there any way I can start my server or fix these nodes.
{"Code":"InternalError","Message":"","Category":"InternalError","ExitCode":1,"Details":null}
Thanks for reporting the problem. can you please provide stdout.txt and stderr.txt from /mnt/batch/tasks/startup/ for investigation? You can solve the problem by resizing the cluster to 0 and back to 2.
az batchai cluster resize -n -g -t 0
az batchai cluster resize -n -g -t 1
Thanks,
Alex
vidit-bhatia, I see you still have one node in unusable state. You probably would like to delete it if you are not using it via ssh, because it's still allocated and is considered to be used by you (so, it will be included in the bill). You can just set min size for your cluster to 0 to delete nodes when you are not using them.
Please note, system checks if it needs to resize cluster every 5 mins. So, it can take up to 5 mins for BatchAI to start nodes allocation after you submit a job.
vidit-bhatia. Can you please recreate your cluster? The issue is that your cluster was created before the recent ubuntu meltdown patch and kernel update. Now when your cluster is trying to allocate nodes it gets new kernel but old drivers.
@AlekseiPolkovnikov I will look into it on Monday see how that can be done as the cluster is used already by some people.
We have implemented a workaround on our side to make nodes after resize to pick up new drivers. So, you may keep the cluster and just make sure that all your unusable nodes removed
@AlexanderYukhanov Seems like the workaround does not work
what is happening?
taking a look
Now it's a different issue - "Blob fuse mounting failed". Can you please check account name, key and container name?
Looking into it
@AlexanderYukhanov the python API s does not allow me to update mount settings? Do I need to delete and recreate server again
Yes, it's not possible to change mount settings after cluster has been created.