microsoftarchive/BatchAI

nvidiaConfigFail: NVIDIA GPU configuration failed unexpectedly

giventocode opened this issue · 5 comments

I am trying to run the python CNTK distributed GPU recipe. And the job is failing with the following error:

Job state: queued ExitCode: None
Cluster state: AllocationState.steady Target: 2; Allocated: 2; Idle: 0; Unusable: 2; Running: 0; Preparing: 0; Leaving: 0
Cluster error: nvidiaConfigFail: NVIDIA GPU configuration failed unexpectedly
Details:
Reason: Failed to install nvidia-docker

Any suggestion?

Having a similiar issue when creating a cluster. My nodes are unusable.

I checked the startup error logs in /mnt/batch/tasks/startup/error.json and it says:

{"Code":"nvidiaConfigFail","Message":"NVIDIA GPU configuration failed unexpectedly","Category":"InternalError","ExitCode":1,"Details":[{"Key":"Reason","Value":"Failed to install nvidia-docker"}]}

Similarly, from /mnt/batch/tasks/startup/stderr.txt

(...)
018/04/03 00:40:50 install nvidia-docker
dpkg: dependency problems prevent configuration of nvidia-docker:
 nvidia-docker2 (2.0.2+docker17.12.0-1) breaks nvidia-docker and is installed.

dpkg: error processing package nvidia-docker (--install):
 dependency problems - leaving unconfigured
Errors were encountered while processing:
 nvidia-docker

Sounds like the same issue. Still looking for suggestions.

Hi the issue was related with the new version of DSVM. We are rolling out a fix for this.

The fix is out.
Sorry for inconvenience

Can anyone confirm this is actually resolved? If so what do I have to do to fix my cluster?

@CameronVetter Please recreate your Batch AI cluster to get the fix for this