arcus-azure/arcus.azureml

CUDA warning with GPU instance - No GPU usage

Closed this issue · 0 comments

Describe the bug
When launching training on an Azure GPU compute instance, driver log gives the following warning/error:

2021-02-11 09:16:39.618705: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/intel/compilers_and_libraries_2018.3.222/linux/mpi/intel64/lib:/opt/intel/compilers_and_libraries_2018.3.222/linux/mpi/mic/lib:/opt/intel/compilers_and_libraries_2018.3.222/linux/mpi/intel64/lib:/opt/intel/compilers_and_libraries_2018.3.222/linux/mpi/mic/lib:/azureml-envs/azureml_29a6b4567f8800f7805e1039ee4701fc/lib:
2021-02-11 09:16:39.618798: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.

Seems as though libcudart is either missing or can't be found. Result is that training a CNN takes really long (since it's using CPU).

To Reproduce
Launch a training script using tensorflow/keras through arcus targeting a GPU compute instance on Azure, such as:

from arcus.azureml.environment.aml_environment import AzureMLEnvironment

work_env = AzureMLEnvironment.Create(config_file="../.azureml/config.json")

training_name = 'your_training_name'
trainer = work_env.start_experiment(training_name)
trainer.setup_training(training_name, overwrite=False)

dataset_name = 'your_dataset_name'

arguments = {
    '--epochs': 75,
    '--batch_size': 256,
    '--es_patience': 20,
    '--train_test_split_ratio': 0.08
}
trainer.start_training(training_name, estimator_type='tensorflow', 
                       input_datasets = [dataset_name], 
                       compute_target='your_instance', gpu_compute=True, script_parameters = arguments)

Expected behavior
Warning shouldn't appear given GPU installed on CI & training should be fast.