GoogleCloudPlatform/cluster-toolkit
Cluster Toolkit is an open-source software offered by Google Cloud which makes it easy for customers to deploy AI/ML and HPC environments on Google Cloud.
HCLApache-2.0
Issues
- 1
- 3
llama2-finetuning-slurm YAML blueprint: schedmd-slurm-gcp-v7-partition not found
#3149 opened by xibinliu - 5
- 3
Using a newer version of Terraform can lead to controller replacement on reconfigure for Slurm GCP v6
#2774 opened by nick-stroud - 1
Deploying latest nvidia driver cuda 12.4 breaks TCPX on A3 high from functioning correctly
#3032 opened by saltysoup - 5
Slurm accounting data not loading to BigQuery
#2989 opened by fdmalone - 1
Unable to SSH into Login Node Deployed with hpc-toolkit due to IAM Permissions Issue
#2950 opened by gouki510 - 3
slurm-gcp-v6-controller / pre-existing-network-storage - '$controller' not added to mounts
#2869 opened by scott-nag - 13
- 1
I have a regular cloudshell and I keep running out of disk space when using this repo (because of go deps?)
#2866 opened by srcc-chekh - 1
The apptainer example fails to deploy because it is using the `slurm-gcp-6-4-hpc-rocky-linux-8` `source_image_family`
#2802 opened by mr0re1 - 5
- 5
Rocky image failing due to 404 on lustre-client
#2733 opened by javierbq - 31
How to use image-builder.yaml to install a docker image to template VM
#1598 opened by noahharrison64 - 2
No CUDA devices visible with A2 instances
#2634 opened by msis - 4
Fail to consume shared reservations
#2548 opened by casassg - 16
PMIx MPI support in Slurm
#2274 opened by tpdownes - 6
Upgrade to Ops Agent fails
#2487 opened by Tristan-Kosciuch - 8
HTCondor tutorial: add cloudresourcemanager.googleapis.com to the list of services to enable
#2496 opened by katilp - 4
- 3
- 2
IP space of [gcp project subnet] is exhausted when deploying a GCP Slurm cluster
#2389 opened by fdmalone - 1
Broken link
#2261 opened by prashantkul - 2
Example of startup script with cluster without vm-instance?
#2202 opened by vsoch - 4
- 0
- 5
- 2
error when use packer to build image in ml-slurm
#1832 opened by higuhigu-lb - 2
- 2
- 1
- 4
- 40
SLURM 1.20 deployed and having node creation error
#1600 opened by sharif-cameco - 7
- 6
HPC toolkit no longer works with a2 instances
#1664 opened by cbraynor - 12
Cannot create worker node
#1581 opened by sharif-cameco - 3
Creating router and NAT in pre existing vpc
#1590 opened by sharif-cameco - 2
- 1
Give a short summary of changes on ghpc deploy/destroy
#1556 opened by yaroslavvb - 3
Adding partition causes the entire cluster to fail due to failures in `/slurm/scripts/setup.py`
#1554 opened by yaroslavvb - 4
ghpc deploy ends up in bad state when instance creation fails due to transient problem
#1536 opened by yaroslavvb - 2
- 3
User management best practices/examples
#1458 opened by jtrmal - 4
NFS server file system bug
#1388 opened by maxveliaminov - 2
Chrome-remote-desktop support
#1405 opened by rgclapp007 - 1
MPI arguments for MPI jobs
#1386 opened by ZLPG23 - 8
- 0
VM name needs updated
#1303 opened by jrossthomson - 5
- 6
TKFE Deployed Cluster fails to initialize slurm
#1185 opened by matthewc2003