CUDA warning with GPU instance - No GPU usage
Closed this issue · 0 comments
ALaks96 commented
Describe the bug
When launching training on an Azure GPU compute instance, driver log gives the following warning/error:
2021-02-11 09:16:39.618705: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/intel/compilers_and_libraries_2018.3.222/linux/mpi/intel64/lib:/opt/intel/compilers_and_libraries_2018.3.222/linux/mpi/mic/lib:/opt/intel/compilers_and_libraries_2018.3.222/linux/mpi/intel64/lib:/opt/intel/compilers_and_libraries_2018.3.222/linux/mpi/mic/lib:/azureml-envs/azureml_29a6b4567f8800f7805e1039ee4701fc/lib:
2021-02-11 09:16:39.618798: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Seems as though libcudart is either missing or can't be found. Result is that training a CNN takes really long (since it's using CPU).
To Reproduce
Launch a training script using tensorflow/keras through arcus targeting a GPU compute instance on Azure, such as:
from arcus.azureml.environment.aml_environment import AzureMLEnvironment
work_env = AzureMLEnvironment.Create(config_file="../.azureml/config.json")
training_name = 'your_training_name'
trainer = work_env.start_experiment(training_name)
trainer.setup_training(training_name, overwrite=False)
dataset_name = 'your_dataset_name'
arguments = {
'--epochs': 75,
'--batch_size': 256,
'--es_patience': 20,
'--train_test_split_ratio': 0.08
}
trainer.start_training(training_name, estimator_type='tensorflow',
input_datasets = [dataset_name],
compute_target='your_instance', gpu_compute=True, script_parameters = arguments)
Expected behavior
Warning shouldn't appear given GPU installed on CI & training should be fast.