No CUDA devices visible with A2 instances
Closed this issue · 2 comments
msis commented
Describe the bug
Nodes launched with a modified version of ./examples/ml_slurm.yaml
do not seem to see GPU with CUDA
Steps to reproduce
Steps to reproduce the behavior:
- create and deploy a cluster with the blueprint
ml_slurm_a100.yaml
below - SSH to login node
- start an a2 instance:
srun --partition a10040g1gpu --pty bash -i
conda activate pytorch
nvidia-smi
or in a python consoleimport torch; torch.cuda.is_available()
Expected behavior
nvidia-smi
should list GPUs available.torch.cuda.is_avaiable()
should returnTrue
Actual behavior
$ nvidia-smi
No devices were found
$ python
Python 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
False
Version (ghpc --version
)
$ ghpc --version
ghpc version v1.34.0
Built from 'main' branch.
Commit info: v1.34.0-0-g5b360ae6
Blueprint
If applicable, attach or paste the blueprint YAML used to produce the bug.
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
---
blueprint_name: ml-slurm-v6
vars:
project_id: ## Set project id here
deployment_name: ml-training-v6
region: us-central1
zone: us-central1-a
new_image:
family: ml-training
project: $(vars.project_id)
disk_size_gb: 32
enable_cleanup_compute: true
# Recommended to use GCS backend for Terraform state
# See https://github.com/GoogleCloudPlatform/hpc-toolkit/tree/main/examples#optional-setting-up-a-remote-terraform-state
#
# terraform_backend_defaults:
# type: gcs
# configuration:
# bucket: <<BUCKET_NAME>>
deployment_groups:
- group: primary
modules:
# Source is an embedded module, denoted by "modules/*" without ./, ../, /
# as a prefix. To refer to a local module, prefix with ./, ../ or /
# Example - ./modules/network/vpc
- id: network
source: modules/network/vpc
- id: homefs
source: modules/file-system/filestore
use:
- network
settings:
local_mount: /home
# size_gb: 2560
# filestore_tier: BASIC_SSD
- id: script
source: modules/scripts/startup-script
settings:
runners:
- type: shell
destination: install-ml-libraries.sh
content: |
#!/bin/bash
# this script is designed to execute on Slurm images published by SchedMD that:
# - are based on Debian 11 distribution of Linux
# - have NVIDIA Drivers v530 pre-installed
# - have CUDA Toolkit 12.1 pre-installed.
set -e -o pipefail
echo "deb https://packages.cloud.google.com/apt google-fast-socket main" > /etc/apt/sources.list.d/google-fast-socket.list
apt-get update --allow-releaseinfo-change
apt-get install --assume-yes google-fast-socket
CONDA_BASE=/opt/conda
if [ -d $CONDA_BASE ]; then
exit 0
fi
DL_DIR=\$(mktemp -d)
cd $DL_DIR
curl -O https://repo.anaconda.com/miniconda/Miniconda3-py310_23.3.1-0-Linux-x86_64.sh
HOME=$DL_DIR bash Miniconda3-py310_23.3.1-0-Linux-x86_64.sh -b -p $CONDA_BASE
cd -
rm -rf $DL_DIR
unset DL_DIR
source $CONDA_BASE/bin/activate base
conda init --system
conda config --system --set auto_activate_base False
# following channel ordering is important! use strict_priority!
conda config --system --set channel_priority strict
conda config --system --remove channels defaults
conda config --system --add channels conda-forge
conda config --system --add channels nvidia
conda update -n base conda --yes
### create a virtual environment for pytorch
conda create -n pytorch python=3.10 --yes
conda activate pytorch
conda config --env --add channels pytorch
conda install -n pytorch pytorch torchvision torchaudio pytorch-cuda=12.1 --yes
pip install -q Cython
- group: packer
modules:
- id: custom-image
source: modules/packer/custom-image
kind: packer
use:
- network
- script
settings:
# give VM a public IP to ensure startup script can reach public internet
# w/o new VPC
omit_external_ip: false
source_image_project_id: [schedmd-slurm-public]
# see latest in https://github.com/GoogleCloudPlatform/slurm-gcp/blob/master/docs/images.md#published-image-family
source_image_family: slurm-gcp-6-5-debian-11
# You can find size of source image by using following command
# gcloud compute images describe-from-family <source_image_family> --project schedmd-slurm-public
disk_size: $(vars.disk_size_gb)
image_family: $(vars.new_image.family)
# building this image does not require a GPU-enabled VM
machine_type: n2-standard-4
state_timeout: 15m
- group: cluster
modules:
- id: examples
source: modules/scripts/startup-script
settings:
runners:
- type: data
destination: /var/tmp/torch_test.sh
content: |
#!/bin/bash
source /etc/profile.d/conda.sh
conda activate pytorch
python3 torch_test.py
- type: data
destination: /var/tmp/torch_test.py
content: |
import torch
import torch.utils.benchmark as benchmark
def batched_dot_mul_sum(a, b):
'''Computes batched dot by multiplying and summing'''
return a.mul(b).sum(-1)
def batched_dot_bmm(a, b):
'''Computes batched dot by reducing to bmm'''
a = a.reshape(-1, 1, a.shape[-1])
b = b.reshape(-1, b.shape[-1], 1)
return torch.bmm(a, b).flatten(-3)
# use GPU if available, else CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)
if device.type == 'cuda':
print(torch.cuda.get_device_name(0))
# benchmarking
x = torch.randn(10000, 64)
t0 = benchmark.Timer(
stmt='batched_dot_mul_sum(x, x)',
setup='from __main__ import batched_dot_mul_sum',
globals={'x': x})
t1 = benchmark.Timer(
stmt='batched_dot_bmm(x, x)',
setup='from __main__ import batched_dot_bmm',
globals={'x': x})
print(t0.timeit(100))
print(t1.timeit(100))
- id: a100_40g_1_nodeset
source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
use: [network]
settings:
node_count_dynamic_max: 4
bandwidth_tier: gvnic_enabled
machine_type: a2-highgpu-1g
instance_image: $(vars.new_image)
instance_image_custom: true
preemptible: true
- id: a100_40g_1_partition
source: community/modules/compute/schedmd-slurm-gcp-v6-partition
use: [a100_40g_1_nodeset]
settings:
partition_name: a10040g1gpu
is_default: true
- id: a100_40g_4_nodeset
source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
use: [network]
settings:
node_count_dynamic_max: 4
bandwidth_tier: gvnic_enabled
machine_type: a2-highgpu-4g
instance_image: $(vars.new_image)
instance_image_custom: true
preemptible: true
- id: a100_40g_4_partition
source: community/modules/compute/schedmd-slurm-gcp-v6-partition
use: [a100_40g_4_nodeset]
settings:
partition_name: a10040g4gpu
is_default: true
- id: a100_40g_8_nodeset
source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
use: [network]
settings:
node_count_dynamic_max: 4
bandwidth_tier: gvnic_enabled
machine_type: a2-highgpu-8g
instance_image: $(vars.new_image)
instance_image_custom: true
preemptible: true
- id: a100_40g_8_partition
source: community/modules/compute/schedmd-slurm-gcp-v6-partition
use: [a100_40g_8_nodeset]
settings:
partition_name: a10040g8gpu
is_default: true
- id: a100_40g_16_nodeset
source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
use: [network]
settings:
node_count_dynamic_max: 4
bandwidth_tier: gvnic_enabled
machine_type: a2-megagpu-16g
instance_image: $(vars.new_image)
instance_image_custom: true
preemptible: true
- id: a100_40g_16_partition
source: community/modules/compute/schedmd-slurm-gcp-v6-partition
use: [a100_40g_16_nodeset]
settings:
partition_name: a10040g16gpu
is_default: true
- id: g2_nodeset
source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
use: [network]
settings:
node_count_dynamic_max: 20
enable_placement: false
bandwidth_tier: gvnic_enabled
machine_type: g2-standard-4
instance_image: $(vars.new_image)
instance_image_custom: true
- id: g2_partition
source: community/modules/compute/schedmd-slurm-gcp-v6-partition
use: [g2_nodeset]
settings:
partition_name: g2
exclusive: false
- id: slurm_login
source: community/modules/scheduler/schedmd-slurm-gcp-v6-login
use: [network]
settings:
machine_type: n2-standard-4
name_prefix: "login"
enable_login_public_ips: true
instance_image: $(vars.new_image)
instance_image_custom: true
- id: slurm_controller
source: community/modules/scheduler/schedmd-slurm-gcp-v6-controller
use:
- network
- a100_40g_1_partition
- a100_40g_4_partition
- a100_40g_8_partition
- a100_40g_16_partition
- g2_partition
- homefs
- slurm_login
settings:
machine_type: n2-standard-4
enable_controller_public_ips: true
instance_image: $(vars.new_image)
instance_image_custom: true
login_startup_script: $(examples.startup_script)
Additional context
Add any other context about the problem here.
harshthakkar01 commented
Hi,
Can you try specifying it with --gpus=X
or --gpus-per-node=Y
to srun
command when you start a2 instance.
You can find the reference here https://slurm.schedmd.com/srun.html#OPT_gpus
msis commented
That solves it.
I thought because of the instance type, there was no need to set the gpu.
I can confirm that setting gpus
(or gres
) does the job and GPUs a revisible.