slurm-gcp-v6-controller / pre-existing-network-storage - '$controller' not added to mounts
scott-nag opened this issue · 3 comments
Describe the bug
Module scripts located in community/modules/scheduler/schedmd-slurm-gcp-v6-controller/modules/slurm_files/scripts/
(develop branch)
I am creating a v6 cluster using pre-existing-network-storage with the server_ip
set to $controller
in the blueprint. However the start up script fails to mount the storage and times out
[root@clusterb6b-controller ~]# tail -f /slurm/scripts/setup.log
run: ['create-munge-key', '-f']
run: ['systemctl', 'restart', 'munge']
Set up network storage
Temporary failure in name resolution, retrying in 1
Temporary failure in name resolution, retrying in 2
Temporary failure in name resolution, retrying in 4
Temporary failure in name resolution, retrying in 8
Temporary failure in name resolution, retrying in 16
Temporary failure in name resolution, retrying in 32
Temporary failure in name resolution, retrying in 64
Temporary failure in name resolution, retrying in 128
Temporary failure in name resolution, retrying in 256
[Errno -2] Name or service not known
Traceback (most recent call last):
File "/slurm/scripts/setup.py", line 494, in <module>
main()
File "/slurm/scripts/setup.py", line 468, in main
{
File "/slurm/scripts/setup.py", line 335, in setup_controller
setup_network_storage(log)
File "/slurm/scripts/setup_network_storage.py", line 100, in setup_network_storage
ext_mounts, int_mounts = separate_external_internal_mounts(all_mounts)
File "/slurm/scripts/setup_network_storage.py", line 91, in separate_external_internal_mounts
return separate(internal_mount, mounts)
File "/slurm/scripts/util.py", line 698, in separate
return reduce(lambda acc, el: acc[pred(el)].append(el) or acc, coll, ([], []))
File "/slurm/scripts/util.py", line 698, in <lambda>
return reduce(lambda acc, el: acc[pred(el)].append(el) or acc, coll, ([], []))
File "/slurm/scripts/setup_network_storage.py", line 88, in internal_mount
mount_addr = util.host_lookup(server_ip)
File "/slurm/scripts/util.py", line 687, in wrapper
raise captured_exc
File "/slurm/scripts/util.py", line 680, in wrapper
return f(*args, **kwargs)
File "/slurm/scripts/util.py", line 1160, in host_lookup
return socket.gethostbyname(host_name)
socket.gaierror: [Errno -2] Name or service not known
Aborting setup...
run: ['wall', '-n', '*** Slurm setup failed! Please view log: /slurm/scripts/setup.log ***']
*** Slurm setup failed! Please view log: /slurm/scripts/setup.log ***
I believe the server_ip
in the storage mounts should contain the host name instead of $controller
, similar to how the second mount shows cluster9f3-controller
successfully here?
Resolved network storage mounts: [{'fs_type': 'nfs', 'local_mount': '/home', 'mount_options': 'defaults,nofail,nosuid', 'remote_mount': '/home', 'server_ip': '$controller'}, {'server_ip': 'cluster9f3-controller', 'remote_mount': '/opt/apps', 'local_mount': '/opt/apps', 'fs_type': 'nfs', 'mount_options': 'defaults,hard,intr'}, {'fs_type': 'nfs', 'local_mount': '/opt/cluster', 'mount_options': 'defaults,nofail,nosuid', 'remote_mount': '/opt/cluster', 'server_ip': '$controller'}]
Separating external and internal mounts
Checking if mount is internal: {'fs_type': 'nfs', 'local_mount': '/home', 'mount_options': 'defaults,nofail,nosuid', 'remote_mount': '/home', 'server_ip': '$controller'}
Temporary failure in name resolution, retrying in 1
Temporary failure in name resolution, retrying in 2
Temporary failure in name resolution, retrying in 4
...
Steps to reproduce
- Create a VPC and subnet
- Create cluster using the blueprint
- Check the instances logs
Expected behavior
Storage should be successfully mounted and timeout should not happen
Actual behavior
Timeouts as shown in the above logs
Version (gcluster --version
)
gcluster version - not built from official release
Built from 'develop' branch.
Commit info: v1.37.1-167-g1d7dc338-dirty
Terraform version: 1.9.3
(tested with Terraform 1.4 too)
Blueprint
If applicable, attach or paste the blueprint YAML used to produce the bug.
blueprint_name: cluster-b6be43f5
vars:
project_id: ofetest
deployment_name: cluster-b6be43f5
region: us-central1
zone: us-central1-c
enable_cleanup_compute: True
enable_bigquery_load: False
instance_image_custom: True
labels:
created_by: testofe-server
deployment_groups:
- group: primary
modules:
- source: modules/network/pre-existing-vpc
kind: terraform
settings:
network_name: proper-hound-network
subnetwork_name: proper-hound-subnet-2
id: hpc_network
- source: modules/file-system/pre-existing-network-storage
kind: terraform
id: mount_num_1
settings:
server_ip: '$controller'
remote_mount: /opt/cluster
local_mount: /opt/cluster
mount_options: defaults,nofail,nosuid
fs_type: nfs
- source: modules/file-system/pre-existing-network-storage
kind: terraform
id: mount_num_2
settings:
server_ip: '$controller'
remote_mount: /home
local_mount: /home
mount_options: defaults,nofail,nosuid
fs_type: nfs
- source: community/modules/project/service-account
kind: terraform
id: hpc_service_account
settings:
project_id: ofetest
name: sa
project_roles:
- compute.instanceAdmin.v1
- iam.serviceAccountUser
- monitoring.metricWriter
- logging.logWriter
- storage.objectAdmin
- pubsub.admin
- compute.securityAdmin
- iam.serviceAccountAdmin
- resourcemanager.projectIamAdmin
- compute.networkAdmin
- source: community/modules/compute/schedmd-slurm-gcp-v6-partition
kind: terraform
id: partition_1
use:
- partition_1-nodeset
settings:
partition_name: batch
exclusive: True
resume_timeout: 500
- source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
id: partition_1-nodeset
use:
- mount_num_1
- mount_num_2
settings:
bandwidth_tier: platform_default
subnetwork_self_link: "projects/ofetest/regions/us-central1/subnetworks/proper-hound-subnet-2"
enable_smt: False
enable_placement: False
machine_type: c2-standard-4
node_count_dynamic_max: 1
node_count_static: 0
disk_size_gb: 50
disk_type: pd-standard
- source: community/modules/scheduler/schedmd-slurm-gcp-v6-controller
kind: terraform
id: slurm_controller
settings:
cloud_parameters:
resume_rate: 0
resume_timeout: 500
suspend_rate: 0
suspend_timeout: 300
no_comma_params: false
machine_type: n2-standard-2
disk_type: pd-standard
disk_size_gb: 120
service_account_email: $(hpc_service_account.service_account_email)
service_account_scopes:
- https://www.googleapis.com/auth/cloud-platform
- https://www.googleapis.com/auth/monitoring.write
- https://www.googleapis.com/auth/logging.write
- https://www.googleapis.com/auth/devstorage.read_write
- https://www.googleapis.com/auth/pubsub
controller_startup_script: |
#!/bin/bash
echo "******************************************** CALLING CONTROLLER STARTUP"
compute_startup_script: |
#!/bin/bash
echo "******************************************** CALLING COMPUTE STARTUP"
login_startup_script: |
#!/bin/bash
echo "******************************************** CALLING LOGIN STARTUP"
use:
- slurm_login
- hpc_network
- partition_1
- mount_num_1
- mount_num_2
- source: community/modules/scheduler/schedmd-slurm-gcp-v6-login
kind: terraform
id: slurm_login
settings:
num_instances: 1
subnetwork_self_link: "projects/ofetest/regions/us-central1/subnetworks/proper-hound-subnet-2"
machine_type: n2-standard-2
disk_type: pd-standard
disk_size_gb: 120
service_account_email: $(hpc_service_account.service_account_email)
service_account_scopes:
- https://www.googleapis.com/auth/cloud-platform
- https://www.googleapis.com/auth/monitoring.write
- https://www.googleapis.com/auth/logging.write
- https://www.googleapis.com/auth/devstorage.read_write
Output and logs
N/A - blueprint is successfully deployed
Execution environment
- OS: Rocky 8.5
- Shell: bash
- go version: go1.22.5
- Terraform: both 1.4 and 1.9
Other info
I have added a quick fix to the resolve_network_storage
function that is located in setup_network_storage.py
(the "for mount in mounts.values()" loop) as I noticed similar logic relating to $controller
in util.py
def resolve_network_storage(nodeset=None):
"""Combine appropriate network_storage fields to a single list"""
if lkp.instance_role == "compute":
try:
nodeset = lkp.node_nodeset()
except Exception:
# External nodename, skip lookup
nodeset = None
# seed mounts with the default controller mounts
if cfg.disable_default_mounts:
default_mounts = []
else:
default_mounts = [
NSDict(
{
"server_ip": lkp.control_addr or lkp.control_host,
"remote_mount": str(path),
"local_mount": str(path),
"fs_type": "nfs",
"mount_options": "defaults,hard,intr",
}
)
for path in (
dirs.home,
dirs.apps,
)
]
# create dict of mounts, local_mount: mount_info
mounts = mounts_by_local(default_mounts)
# On non-controller instances, entries in network_storage could overwrite
# default exports from the controller. Be careful, of course
mounts.update(mounts_by_local(cfg.network_storage))
if lkp.instance_role in ("login", "controller"):
mounts.update(mounts_by_local(cfg.login_network_storage))
if nodeset is not None:
mounts.update(mounts_by_local(nodeset.network_storage))
# Replace $controller with the actual hostname in all mounts
for mount in mounts.values():
if mount['server_ip'] == '$controller':
mount['server_ip'] = cfg.slurm_control_host
return list(mounts.values())
This successfully gets the startup scripts to run and gets the login and controller nodes online
Setting up network storage
Resolving network storage
Resolved network storage mounts: [{'fs_type': 'nfs', 'local_mount': '/home', 'mount_options': 'defaults,nofail,nosuid', 'remote_mount': '/home', 'server_ip': 'clusterad2-controller'}, {'server_ip': 'clusterad2-controller', 'remote_mount': '/opt/apps', 'local_mount': '/opt/apps', 'fs_type': 'nfs', 'mount_options': 'defaults,hard,intr'}, {'fs_type': 'nfs', 'local_mount': '/opt/cluster', 'mount_options': 'defaults,nofail,nosuid', 'remote_mount': '/opt/cluster', 'server_ip': 'clusterad2-controller'}]
External mounts: [], Internal mounts: [{'fs_type': 'nfs', 'local_mount': '/home', 'mount_options': 'defaults,nofail,nosuid', 'remote_mount': '/home', 'server_ip': 'clusterad2-controller'}, {'server_ip': 'clusterad2-controller', 'remote_mount': '/opt/apps', 'local_mount': '/opt/apps', 'fs_type': 'nfs', 'mount_options': 'defaults,hard,intr'}, {'fs_type': 'nfs', 'local_mount': '/opt/cluster', 'mount_options': 'defaults,nofail,nosuid', 'remote_mount': '/opt/cluster', 'server_ip': 'clusterad2-controller'}]
Instance is controller, using external mounts
Creating backup of fstab
Restoring fstab from backup
Mounting fstab entries
Handling munge mount
About to run custom scripts
Determined custom script directories: [PosixPath('/slurm/custom_scripts/controller.d')]
Collected custom scripts: [PosixPath('/slurm/custom_scripts/controller.d/ghpc_startup.sh')]
Custom scripts to run: /slurm/custom_scripts/(controller.d/ghpc_startup.sh)
Processing script: /slurm/custom_scripts/controller.d/ghpc_startup.sh
Running script ghpc_startup.sh with timeout=300
run: /slurm/custom_scripts/controller.d/ghpc_startup.sh
ghpc_startup.sh returncode=0
stdout=******************************************** CALLING CONTROLLER STARTUP
This is the startup script for the controller on cluster 3
Unfortunately Slurm still isn't configured correctly as shown below, so it is possibly not being replaced elsewhere in the module too. Happy to provide more info if required.
s[root@clusterad2-controller ~]# srun -p batch hostname
srun: Required node not available (down, drained or reserved)
...
[root@clusterad2-controller ~]# cat /var/log/slurm/slurmdbd.log
...
[2024-08-05T14:38:07.093] select/cons_tres: select_p_reconfigure: select/cons_tres: reconfigure
[2024-08-05T14:38:07.093] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 2 partitions
[2024-08-05T14:39:06.000] SchedulerParameters=bf_continue,salloc_wait_nodes,ignore_prefer_validation
[2024-08-05T14:42:08.506] sched: _slurm_rpc_allocate_resources JobId=1 NodeList=clusterad2-partition1node-0 usec=2040
[2024-08-05T14:42:09.769] _update_job: setting admin_comment to GCP Error: Permission denied on locations/{} (or it may not exist). for JobId=1
[2024-08-05T14:42:09.769] _slurm_rpc_update_job: complete JobId=1 uid=981 usec=112
[2024-08-05T14:42:09.780] update_node: node clusterad2-partition1node-0 reason set to: GCP Error: Permission denied on locations/{} (or it may not exist).
[2024-08-05T14:42:09.780] Killing JobId=1 on failed node clusterad2-partition1node-0
[2024-08-05T14:42:09.780] update_node: node clusterad2-partition1node-0 state set to DOWN
Hi @scott-nag , thank you for reporting!
To be fixed by GoogleCloudPlatform/slurm-gcp#194
this is working now perfectly, thank you for the quick fix!
GoogleCloudPlatform/slurm-gcp#194 is included in the latest release.