No way to diagnose 'Unusable' cluster node
damienpontifex opened this issue · 4 comments
I have been trying to create a cluster with the azure cli and a JSON config file. The resulting cluster has the node as 'Unusable'.
I can understand this happens, but my question relates to a method of diagnosing why? There doesn't seem to be any log/reason in the portal and no way to get further details from the CLI other than az batchai cluster list -o table
listing it as 1 unusable node.
My process of setup is predominately following Working with models for machine learning and Azure Batch AI - BRK4034. My config and cli usage is:
nc6config.json
{
"location": "westus2",
"properties": {
"vmSize": "STANDARD_NC6",
"scaleSettings": {
"manual": {
"targetNodeCount": 1
}
},
"virtualMachineConfiguration": {
"imageReference": {
"offer": "linux-data-science-vm-ubuntu",
"publisher": "microsoft-ads",
"sku": "linuxdsvmubuntu",
"version": "latest"
}
},
"nodeSetup": {
"mountVolumes": {
"azureFileShares": [
{
"accountName": "__AZURE_STORAGE_ACCOUNT__",
"azureFileUrl": "https://__AZURE_STORAGE_ACCOUNT__.file.core.windows.net/data",
"credentialsInfo": {
"accountKey": "__AZURE_STORAGE_KEY__"
},
"relativeMountPath": "azfiles"
}
]
}
},
"userAccountSettings": {
"adminUserName": "damien",
"adminUserSshPublicKey": "<my-public-key>"
}
}
}
CLI
RG_NAME=batch-rg
LOCATION=westus2
export AZURE_BATCHAI_STORAGE_ACCOUNT=pontifexml
az group create \
-l $LOCATION \
-n $RG_NAME
az storage account create \
-g $RG_NAME \
-n $AZURE_BATCHAI_STORAGE_ACCOUNT \
--access-tier StandardLRS \
-l $LOCATION
export AZURE_BATCHAI_STORAGE_KEY=$(az storage account keys list --account-name $AZURE_BATCHAI_STORAGE_ACCOUNT --resource-group $RG_NAME | head -n1 | awk '{print $3}')
az batchai cluster create \
-g $RG_NAME \
-n dsvm \
-c nc6config.json
Hello Damien,
Thank you for reporting the issue. The video you used was created based on private preview version of Batch AI. During release we had to make our CLI extension consistent with others, and changed the way placeholders are defined. If you really need to use them (there are simpler ways to mount AFS and blobs now) specify:
"accountName": "<AZURE_BATCHAI_STORAGE_ACCOUNT>",
"azureFileUrl": "https://<AZURE_BATCHAI_STORAGE_ACCOUNT>.file.core.windows.net/share",
"credentialsInfo": {
"accountKey": "<AZURE_BATCHAI_STORAGE_KEY>"
},
(take a look for example https://github.com/Azure/azure-cli/blob/dev/src/command_modules/azure-cli-batchai/azure/cli/command_modules/batchai/tests/data/cluster_with_azure_files.json).
In your case cluster creation is very simple and doesn't require configuration file at all (please take a look at our recipes, e.g. https://github.com/Azure/BatchAI/blob/master/recipes/TensorFlow/TensorFlow-GPU-Distributed/cli-instructions.md):
$ az batchai cluster create -l eastus -g batch-rg -s Standard_NC6 -i UbuntuDSVM --min 1 --max 1 --storage-account-name pontifexml --afs-name share-u $USER -k ~/.ssh/id_rsa.pub
Returning the main question. Unfortunately, we do not have a simple way to debug such issues in public preview. We will definitely fix it before GA. The only way to debug is to get startup task failure code directly from the node:
-
get the ssh information for the cluster:
$ az batchai cluster list-nodes -g -n -
get the error.json
$ ssh -p cat /mnt/batch/tasks/startup/error.json
In your case it should be something like:
{"Code":"AZFMountError","Message":"unable to mount Azure File","Category":"UserError","ExitCode":1,"Details":[{"Key":"azureFileURL","Value":"//AZURE_STORAGE_ACCOUNT.file.core.windows.net/one"}]}
- you can take a look at complete startup logs on the node:
$ ssh -p cat /mnt/batch/tasks/startup/stderr.txt
Thanks,
Alex
Thanks @AlexanderYukhanov that was everything I needed.
One thing I seem to be missing from the CLI vs config files is the vmPriority
property. Is it possible to configure the cluster as low priority VMs with the CLI? Or should I create an issue on the az cli repo be more appropriate for this?
It's possible right now, just create a simple cluster.json with the following content:
{
"vmPriority": "lowpriority"
}
and create cluster with the following command line:
$az batchai cluster create -l eastus -g batch-rg -s Standard_NC6 -i UbuntuDSVM --min 1 --max 1 --storage-account-name pontifexml --afs-name share-u $USER -k ~/.ssh/id_rsa.pub -c cluster.json
We will add --low-priority attribute into command line as well in the next release (about 2 weeks).
Yep, this is the same conclusion I came to. Good to see it'll be making its way into the CLI without a config file.
Thanks