GoogleCloudPlatform/cluster-toolkit

Unable to configure Slurm due to failure to mount filestore

dmontielg opened this issue · 5 comments

Hi,

Thanks a lot for hpc-toolkit.

We are having some issues by running a slurm job presented here:
https://cloud.google.com/hpc-toolkit/docs/quickstarts/slurm-cluster

Given our internal policies we have some restrictions from using external IPs, there are also shared VPCs already in place.
So for this we made the following modifications to the quickstart example:
disable_controller_public_ips and disable_login_public_ips to true
And used a pre-existing VPC network
'''

  • id: network1
    source: modules/network/pre-existing-vpc
    settings:
    network_name: default
    subnetwork_name: default
    '''

The whole hpc-slurm.yaml can be found here: https://surfdrive.surf.nl/files/index.php/s/8CW6A47UfI0MagV

The error that we get is that compute engine instances are unable to connect to filestore. When connecting to the login node, the notification that slurm is currently being configured is shown. When attempting to run srun, the following error is displayed:

$ srun
srun: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host
srun: error: fetch_config: DNS SRV lookup failed
srun: error: _establish_config_source: failed to fetch config
srun: fatal: Could not establish a configuration source

Cloud logging shows occasional errors that the compute engine cannot connect to the filestore instance. Filestore however does mount when performing the mount manually on the login compute engine instance.

Oct 24 14:30:22 hpcsmall-login-yuznj5e9-001 systemd: Dependency failed for Remote File Systems.
Oct 24 14:30:22 hpcsmall-login-yuznj5e9-001 systemd: Job remote-fs.target/start failed with result 'dependency'.
Oct 24 14:30:22 hpcsmall-login-yuznj5e9-001 systemd: Unit home.mount entered failed state.
Oct 24 14:30:22 hpcsmall-login-yuznj5e9-001 systemd: Failed to mount /home.
Oct 24 14:30:22 hpcsmall-login-yuznj5e9-001 systemd: Failed to mount /opt/apps.

Steps to reproduce the behavior:

login cloud shell
install hpc-toolkit
git clone https://github.com/GoogleCloudPlatform/hpc-toolkit.git
cd hpc-toolkit
make
./ghpc -v
./ghpc create examples/hpc-slurm.yaml -l ERROR --vars project_id=project-id,region=europe-west1,zone=europe-west1-b
./ghpc deploy hpc-small
gcloud compute ssh hpcsmall-login-yuznj5e9-001
srun

Version (ghpc --version)
./ghpc -v
ghpc version v1.24.0
Built from 'main' branch.
Commit info: v1.24.0-0-ge64f027e

Thanks for your help!
Best,
Diego

Hi Diego,

Your blueprint looks correct.

I would recommend logging into the controller node and looking at what messages you see there. It is likely the error originated at the controller and the login node could not configure because the controller was broken.

One possible idea here is that if the slurm controller and login nodes do not have public IP addresses, then they must have either private API access to GCS or be configured behind a NAT as they will need to retrieve configuration from GCS and various other Google APIs. You will likely see that at the controller logs.

Please keep me posted.

Regards,

Carlos

P.S. - please have a look at: https://github.com/GoogleCloudPlatform/hpc-toolkit/blob/main/docs/slurm-troubleshooting.md

Hi Carlos,

Thank you for your quick response.
Private Google Access has been enabled on the subnet. A cloud NAT gateway has also been deployed. The following log file is generated when trying to expand the cluster from the controller node.

2023-10-25 12:19:48,080 DEBUG: get_metadata: metadata not found (http://metadata.google.internal/computeMetadata/v1/project/attributes/hpcsmall-slurm-devel)
2023-10-25 12:19:48,080 DEBUG: fetch_devel_scripts: scripts not found in project metadata, devel mode not enabled
2023-10-25 12:19:48,083 INFO: Setting up login
2023-10-25 12:19:48,087 INFO: installing custom scripts: login_7d4hj1wo.d/ghpc_startup.sh
2023-10-25 12:19:48,088 DEBUG: install_custom_scripts: login_7d4hj1wo.d/ghpc_startup.sh
2023-10-25 12:19:48,091 INFO: Set up network storage
2023-10-25 12:19:48,100 INFO: Setting up mount (nfs) 10.111.193.90:/nfsshare to /home
2023-10-25 12:19:48,101 INFO: Setting up mount (nfs) hpcsmall-controller:/opt/apps to /opt/apps
2023-10-25 12:19:48,449 INFO: Waiting for '/home' to be mounted...
2023-10-25 12:19:48,450 INFO: Waiting for '/opt/apps' to be mounted...
2023-10-25 12:24:06,319 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/opt/apps']' timed out after 120 seconds
2023-10-25 12:24:07,321 INFO: Waiting for '/opt/apps' to be mounted...
2023-10-25 12:28:24,879 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/opt/apps']' timed out after 120 seconds
2023-10-25 12:28:24,885 ERROR: mount of path '/home' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/home']' timed out after 120 seconds
2023-10-25 12:28:25,886 INFO: Waiting for '/home' to be mounted...
2023-10-25 12:28:26,480 INFO: Waiting for '/opt/apps' to be mounted...
2023-10-25 12:28:28,906 ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.
2023-10-25 12:28:30,508 INFO: Waiting for '/home' to be mounted...
2023-10-25 12:32:44,079 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/opt/apps']' timed out after 120 seconds
2023-10-25 12:32:46,640 INFO: Waiting for '/opt/apps' to be mounted...
2023-10-25 12:32:47,095 ERROR: mount of path '/home' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/home']' timed out after 120 seconds
2023-10-25 12:32:49,657 INFO: Waiting for '/home' to be mounted...
2023-10-25 12:37:04,175 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/opt/apps']' timed out after 120 seconds
2023-10-25 12:37:07,192 ERROR: mount of path '/home' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/home']' timed out after 120 seconds
2023-10-25 12:37:08,272 INFO: Waiting for '/opt/apps' to be mounted...
2023-10-25 12:37:11,289 INFO: Waiting for '/home' to be mounted...
2023-10-25 12:41:25,807 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/opt/apps']' timed out after 120 seconds
2023-10-25 12:41:28,826 ERROR: mount of path '/home' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/home']' timed out after 120 seconds
2023-10-25 12:41:32,362 INFO: Waiting for '/opt/apps' to be mounted...
2023-10-25 12:41:35,381 INFO: Waiting for '/home' to be mounted...
2023-10-25 12:45:49,871 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/opt/apps']' timed out after 120 seconds
2023-10-25 12:45:52,885 ERROR: mount of path '/home' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/home']' timed out after 120 seconds
2023-10-25 12:46:00,358 INFO: Waiting for '/opt/apps' to be mounted...
2023-10-25 12:46:03,372 INFO: Waiting for '/home' to be mounted...
2023-10-25 12:50:17,903 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/opt/apps']' timed out after 120 seconds
2023-10-25 12:50:20,917 ERROR: mount of path '/home' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/home']' timed out after 120 seconds
2023-10-25 12:50:33,905 INFO: Waiting for '/opt/apps' to be mounted...
2023-10-25 12:50:36,919 INFO: Waiting for '/home' to be mounted...
2023-10-25 12:54:51,439 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/opt/apps']' timed out after 120 seconds
2023-10-25 12:54:54,453 ERROR: mount of path '/home' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/home']' timed out after 120 seconds
2023-10-25 12:55:07,441 INFO: Waiting for '/opt/apps' to be mounted...
2023-10-25 12:55:10,455 INFO: Waiting for '/home' to be mounted...
2023-10-25 12:59:24,975 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/opt/apps']' timed out after 120 seconds
2023-10-25 12:59:27,989 ERROR: mount of path '/home' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/home']' timed out after 120 seconds
2023-10-25 12:59:40,977 INFO: Waiting for '/opt/apps' to be mounted...
2023-10-25 12:59:43,991 INFO: Waiting for '/home' to be mounted...
2023-10-25 13:03:58,511 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/opt/apps']' timed out after 120 seconds
2023-10-25 13:04:01,526 ERROR: mount of path '/home' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/home']' timed out after 120 seconds
2023-10-25 13:04:14,513 INFO: Waiting for '/opt/apps' to be mounted...
2023-10-25 13:04:17,528 INFO: Waiting for '/home' to be mounted...
2023-10-25 13:08:32,047 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/opt/apps']' timed out after 120 seconds
2023-10-25 13:08:35,063 ERROR: mount of path '/home' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/home']' timed out after 120 seconds
2023-10-25 13:08:48,048 INFO: Waiting for '/opt/apps' to be mounted...
2023-10-25 13:08:51,065 INFO: Waiting for '/home' to be mounted...
2023-10-25 13:13:05,583 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/opt/apps']' timed out after 120 seconds
2023-10-25 13:13:08,600 ERROR: mount of path '/home' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/home']' timed out after 120 seconds
2023-10-25 13:13:21,585 INFO: Waiting for '/opt/apps' to be mounted...
2023-10-25 13:13:24,602 INFO: Waiting for '/home' to be mounted...
2023-10-25 13:17:39,119 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/opt/apps']' timed out after 120 seconds
2023-10-25 13:17:42,136 ERROR: mount of path '/home' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/home']' timed out after 120 seconds
2023-10-25 13:17:55,121 INFO: Waiting for '/opt/apps' to be mounted...
2023-10-25 13:17:58,138 INFO: Waiting for '/home' to be mounted...
2023-10-25 13:22:12,655 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/opt/apps']' timed out after 120 seconds
2023-10-25 13:22:15,669 ERROR: mount of path '/home' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/home']' timed out after 120 seconds
2023-10-25 13:22:28,657 INFO: Waiting for '/opt/apps' to be mounted...
2023-10-25 13:22:31,672 INFO: Waiting for '/home' to be mounted...
2023-10-25 13:26:46,191 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/opt/apps']' timed out after 120 seconds
2023-10-25 13:26:49,206 ERROR: mount of path '/home' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/home']' timed out after 120 seconds
2023-10-25 13:27:02,192 INFO: Waiting for '/opt/apps' to be mounted...
2023-10-25 13:27:05,208 INFO: Waiting for '/home' to be mounted...
2023-10-25 13:31:19,727 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/opt/apps']' timed out after 120 seconds
2023-10-25 13:31:22,744 ERROR: mount of path '/home' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/home']' timed out after 120 seconds
2023-10-25 13:31:35,729 INFO: Waiting for '/opt/apps' to be mounted...
2023-10-25 13:31:38,746 INFO: Waiting for '/home' to be mounted...
2023-10-25 13:35:53,263 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/opt/apps']' timed out after 120 seconds
2023-10-25 13:35:56,277 ERROR: mount of path '/home' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/home']' timed out after 120 seconds
2023-10-25 13:36:09,265 INFO: Waiting for '/opt/apps' to be mounted...
2023-10-25 13:36:12,279 INFO: Waiting for '/home' to be mounted...
2023-10-25 13:40:26,799 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/opt/apps']' timed out after 120 seconds
2023-10-25 13:40:29,813 ERROR: mount of path '/home' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/home']' timed out after 120 seconds
2023-10-25 13:40:42,800 INFO: Waiting for '/opt/apps' to be mounted...
2023-10-25 13:40:45,815 INFO: Waiting for '/home' to be mounted...
2023-10-25 13:45:00,335 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/opt/apps']' timed out after 120 seconds
2023-10-25 13:45:03,350 ERROR: mount of path '/home' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/home']' timed out after 120 seconds
2023-10-25 13:45:16,337 INFO: Waiting for '/opt/apps' to be mounted...
2023-10-25 13:45:19,352 INFO: Waiting for '/home' to be mounted...
2023-10-25 13:49:33,871 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/opt/apps']' timed out after 120 seconds
2023-10-25 13:49:36,888 ERROR: mount of path '/home' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/home']' timed out after 120 seconds
2023-10-25 13:49:49,873 INFO: Waiting for '/opt/apps' to be mounted...
2023-10-25 13:49:52,890 INFO: Waiting for '/home' to be mounted...
2023-10-25 13:54:07,407 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/opt/apps']' timed out after 120 seconds
2023-10-25 13:54:10,423 ERROR: mount of path '/home' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/home']' timed out after 120 seconds
2023-10-25 13:54:23,408 INFO: Waiting for '/opt/apps' to be mounted...
2023-10-25 13:54:26,424 INFO: Waiting for '/home' to be mounted...
2023-10-25 13:58:40,943 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/opt/apps']' timed out after 120 seconds
2023-10-25 13:58:43,959 ERROR: mount of path '/home' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/home']' timed out after 120 seconds
2023-10-25 13:58:56,945 INFO: Waiting for '/opt/apps' to be mounted...
2023-10-25 13:58:59,961 INFO: Waiting for '/home' to be mounted...
2023-10-25 14:03:14,479 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/opt/apps']' timed out after 120 seconds
2023-10-25 14:03:17,495 ERROR: mount of path '/home' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/home']' timed out after 120 seconds
2023-10-25 14:03:30,481 INFO: Waiting for '/opt/apps' to be mounted...
2023-10-25 14:03:33,496 INFO: Waiting for '/home' to be mounted...
2023-10-25 14:07:48,015 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/opt/apps']' timed out after 120 seconds
2023-10-25 14:07:51,031 ERROR: mount of path '/home' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/home']' timed out after 120 seconds
2023-10-25 14:08:04,016 INFO: Waiting for '/opt/apps' to be mounted...
2023-10-25 14:08:07,033 INFO: Waiting for '/home' to be mounted...
2023-10-25 14:12:21,551 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/opt/apps']' timed out after 120 seconds
2023-10-25 14:12:24,568 ERROR: mount of path '/home' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/home']' timed out after 120 seconds
2023-10-25 14:12:37,553 INFO: Waiting for '/opt/apps' to be mounted...
2023-10-25 14:12:40,570 INFO: Waiting for '/home' to be mounted...
2023-10-25 14:16:55,087 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/opt/apps']' timed out after 120 seconds
2023-10-25 14:16:58,102 ERROR: mount of path '/home' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/home']' timed out after 120 seconds
2023-10-25 14:17:11,089 INFO: Waiting for '/opt/apps' to be mounted...
2023-10-25 14:17:14,103 INFO: Waiting for '/home' to be mounted...
2023-10-25 14:21:28,623 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/opt/apps']' timed out after 120 seconds
2023-10-25 14:21:31,641 ERROR: mount of path '/home' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/home']' timed out after 120 seconds
2023-10-25 14:21:44,625 INFO: Waiting for '/opt/apps' to be mounted...
2023-10-25 14:21:47,643 INFO: Waiting for '/home' to be mounted...
2023-10-25 14:26:02,287 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/opt/apps']' timed out after 120 seconds
2023-10-25 14:26:05,305 ERROR: mount of path '/home' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/home']' timed out after 120 seconds
2023-10-25 14:26:18,289 INFO: Waiting for '/opt/apps' to be mounted...
2023-10-25 14:26:21,307 INFO: Waiting for '/home' to be mounted...
2023-10-25 14:30:35,823 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/opt/apps']' timed out after 120 seconds
2023-10-25 14:30:38,840 ERROR: mount of path '/home' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/home']' timed out after 120 seconds
2023-10-25 14:30:51,825 INFO: Waiting for '/opt/apps' to be mounted...
2023-10-25 14:30:54,842 INFO: Waiting for '/home' to be mounted...
2023-10-25 14:35:09,359 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/opt/apps']' timed out after 120 seconds
2023-10-25 14:35:12,373 ERROR: mount of path '/home' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/home']' timed out after 120 seconds
2023-10-25 14:35:25,361 INFO: Waiting for '/opt/apps' to be mounted...
2023-10-25 14:35:28,375 INFO: Waiting for '/home' to be mounted...
2023-10-25 14:39:42,895 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/opt/apps']' timed out after 120 seconds
2023-10-25 14:39:45,911 ERROR: mount of path '/home' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/home']' timed out after 120 seconds
2023-10-25 14:39:58,897 INFO: Waiting for '/opt/apps' to be mounted...
2023-10-25 14:40:01,913 INFO: Waiting for '/home' to be mounted...
2023-10-25 14:44:16,431 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/opt/apps']' timed out after 120 seconds
2023-10-25 14:44:19,447 ERROR: mount of path '/home' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/home']' timed out after 120 seconds
2023-10-25 14:44:32,433 INFO: Waiting for '/opt/apps' to be mounted...
2023-10-25 14:44:35,449 INFO: Waiting for '/home' to be mounted...
2023-10-25 14:48:49,967 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/opt/apps']' timed out after 120 seconds
2023-10-25 14:48:52,986 ERROR: mount of path '/home' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/home']' timed out after 120 seconds
2023-10-25 14:49:05,969 INFO: Waiting for '/opt/apps' to be mounted...
2023-10-25 14:49:08,988 INFO: Waiting for '/home' to be mounted...
2023-10-25 14:53:23,503 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/opt/apps']' timed out after 120 seconds
2023-10-25 14:53:26,518 ERROR: mount of path '/home' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/home']' timed out after 120 seconds
2023-10-25 14:53:39,505 INFO: Waiting for '/opt/apps' to be mounted...
2023-10-25 14:53:42,520 INFO: Waiting for '/home' to be mounted...
2023-10-25 14:57:57,039 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/opt/apps']' timed out after 120 seconds
2023-10-25 14:58:00,054 ERROR: mount of path '/home' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/home']' timed out after 120 seconds
2023-10-25 14:58:13,041 INFO: Waiting for '/opt/apps' to be mounted...
2023-10-25 14:58:16,056 INFO: Waiting for '/home' to be mounted...
2023-10-25 15:02:30,575 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/opt/apps']' timed out after 120 seconds
2023-10-25 15:02:33,593 ERROR: mount of path '/home' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/home']' timed out after 120 seconds
2023-10-25 15:02:46,577 INFO: Waiting for '/opt/apps' to be mounted...
2023-10-25 15:02:49,595 INFO: Waiting for '/home' to be mounted...
2023-10-25 15:07:04,111 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/opt/apps']' timed out after 120 seconds
2023-10-25 15:07:07,127 ERROR: mount of path '/home' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/home']' timed out after 120 seconds
2023-10-25 15:07:20,113 INFO: Waiting for '/opt/apps' to be mounted...
2023-10-25 15:07:23,129 INFO: Waiting for '/home' to be mounted...
2023-10-25 15:11:37,647 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/opt/apps']' timed out after 120 seconds
2023-10-25 15:11:40,662 ERROR: mount of path '/home' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/home']' timed out after 120 seconds
2023-10-25 15:11:53,649 INFO: Waiting for '/opt/apps' to be mounted...
2023-10-25 15:11:56,664 INFO: Waiting for '/home' to be mounted...
2023-10-25 15:16:11,183 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/opt/apps']' timed out after 120 seconds
2023-10-25 15:16:14,199 ERROR: mount of path '/home' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/home']' timed out after 120 seconds
2023-10-25 15:16:27,184 INFO: Waiting for '/opt/apps' to be mounted...
2023-10-25 15:16:30,201 INFO: Waiting for '/home' to be mounted...
2023-10-25 15:20:44,719 ERROR: mount of path '/opt/apps' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/opt/apps']' timed out after 120 seconds
2023-10-25 15:20:47,737 ERROR: mount of path '/home' failed: <class 'subprocess.TimeoutExpired'>: Command '['mount', '/home']' timed out after 120 seconds
2023-10-25 15:20:47,738 ERROR: TimeoutExpired:
    command=['mount', '/opt/apps']
    timeout=120
    stdout:
    stderr:
mount.nfs: Connection timed out
2023-10-25 15:20:47,738 ERROR: Aborting setup...

Thanks again for your response.

Regards,

Diego

Hi Diego,

I assume you redeployed the cluster since the NAT was enabled.
I also assume that you are able to ping the internet from the controller even though that's not the problem here.

It really looks like your controller is unable to connect to the nfs. I suspect your firewall rules do not allow you to reach it.

I would do the following tests:

  1. can you see the filestore instance in the cloud console? (https://console.cloud.google.com/filestore/instances). Does the IP match 10.111.193.90?
  2. can you ping the internal IP address of the login node? Perhaps there is no rule allowing internal traffic within your VPC? (this is usually automatically added when you create your own VPC) We recommend allowing unlimited traffic (all ports) between the internal IPs in the VPC, including
  3. Could you try troubleshooting your filestore? something like sudo showmount -e 10.111.193.90 (https://cloud.google.com/filestore/docs/mount-issues)

Hi Carlos,

We added a firewall rule for internal traffic and the srun from Slurm now worked from the login VM node.

Thanks a lot again for your help and soon response!

Best,
Diego

Fantastic! Glad to see it is now working for you.