NFS server file system bug
maxveliaminov opened this issue · 4 comments
If you find a similar existing issue, please comment on that issue instead of creating a new one.
If you are submitting a feature request, please start a discussion instead of creating an issue.
Describe the bug
When using nfs-server as a file system boot disk of the nfs-server instance is shared while attached disk remains unmounted, which leads to smaller then expected shared volume size, also in this configuration additional disk contains cent os and has 20gb file system. In some cases it could be other way around when attached disk gets mounted as root and is shared and boot disk remains unmounted.
Steps to reproduce
Steps to reproduce the behavior:
- Create hpc cluster project with nfs-server as file system
Expected behavior
Boot disk mounted as root additional disk mounted as /exports/data
Actual behavior
Boot disk mounted as root additional disk is not mounted at all OR Additional disk mounted as root, boot disk is not mounted at all
Version (ghpc --version
)
Blueprint
If applicable, attach or paste the blueprint YAML used to produce the bug.
Expanded Blueprint
If applicable, please attach or paste the expanded blueprint. The expanded blueprint can be obtained by running ghpc expand your-blueprint.yaml
.
Disregard if the bug occurs when running ghpc expand ...
as well.
Output and logs
Screenshots
Nfs server on the left, controller node on the right
Execution environment
- OS: [macOS, ubuntu, ...]
- Shell (To find this, run
ps -p $$
): [bash, zsh, ...] - go version:
Additional context
Add any other context about the problem here.
Hi @maxveliaminov , could you please share your blueprint (exclude any sensitive information)?
Hi @mr0re1 here is one,
blueprint_name: palm-model
vars:
project_id: <PROJECT_ID>
deployment_name: <PROJECT_ID>
region: <REGION>
zone: <ZONE>
machine_type: <MACHINE_TYPE>
node_count_dynamic_max: <NODE_COUNT_DYNAMIC_MAX>
slurm_cluster_name: palm1
disable_public_ips: true
enable_shielded_vm: true
deployment_groups:
- group: primary
modules:
- id: network1
source: modules/network/vpc
kind: terraform
- id: appsfs
source: community/modules/file-system/nfs-server
kind: terraform
use:
- network1
settings:
machine_type: n2-standard-2
auto_delete_disk: true
local_mounts: ['/apps']
- id: spack
source: community/modules/scripts/spack-install
settings:
install_dir: /apps/spack
spack_url: https://github.com/spack/spack
spack_ref: v0.19.1
log_file: /apps/spack.log
spack_cache_url:
- mirror_name: <SPACK_CACHE_NAME>
mirror_url: <SPACK_CACHE_URL>
configs:
- type: file
scope: defaults
content: |
modules:
default:
tcl:
hash_length: 0
all:
conflict:
- '{name}'
projections:
all: '{name}/{version}-{compiler.name}-{compiler.version}'
compilers:
- gcc@8.2.0%gcc@4.8.5 target=x86_64
environments:
- name: palm
content: |
spack:
definitions:
- compilers:
- gcc@8.2.0
- mpis:
- intel-mpi@2018.4.274
- python:
- python@3.9.10
- python_packages:
- py-pip@22.2.2
- py-wheel@0.37.1
- py-google-cloud-storage@1.18.0
- py-ansible@2.9.2
- packages:
- gcc@8.2.0
- coreutils@8.32
- cmake@3.24.3
- flex@2.6.4
- bison@3.8.2
- mpi_packages:
- netcdf-c@4.7.4
- netcdf-fortran@4.5.3
- parallel-netcdf@1.12.2
- fftw@3.3.10
specs:
- matrix:
- - $packages
- - $%compilers
- matrix:
- - $python
- - $%compilers
- matrix:
- - $python_packages
- - $%compilers
- - $^python
- matrix:
- - $mpis
- - $%compilers
- matrix:
- - $mpi_packages
- - $%compilers
- - $^mpis
- id: spack_startup
source: modules/scripts/startup-script
kind: terraform
use:
- network1
settings:
runners:
- $(appsfs.mount_runner)
- $(spack.install_spack_deps_runner)
- $(spack.install_spack_runner)
- type: data
destination: /apps/palm/palm-install.yaml
content: |
123
- type: data
destination: /apps/spack/activate-palm-env.sh
content: |
456
- type: data
destination: /apps/palm/palm-install.sh
content: |
789
- type: shell
content: sudo chmod -R 777 /apps
destination: chmod-apps-dir.sh
- type: shell
content: 'shutdown -h now'
destination: shutdown.sh
- id: spack_builder
source: modules/compute/vm-instance
kind: terraform
use:
- network1
- appsfs
- spack_startup
settings:
name_prefix: spack-builder
- id: homefs
source: community/modules/file-system/nfs-server
kind: terraform
use:
- network1
settings:
machine_type: n2-standard-2
auto_delete_disk: true
local_mounts: ['/home']
- id: debug_node_group
source: community/modules/compute/schedmd-slurm-gcp-v5-node-group
use:
- network1
- homefs
- appsfs
settings:
node_count_dynamic_max: <DEBUG_MAX_NODE_COUNT>
- source: community/modules/compute/schedmd-slurm-gcp-v5-partition
kind: terraform
id: debug_partition
use:
- network1
- homefs
- appsfs
- debug_node_group
settings:
is_default: true
enable_shielded_vm: null
machine_type: null
node_count_dynamic_max: null
partition_name: debug
- id: compute_node_group
source: community/modules/compute/schedmd-slurm-gcp-v5-node-group
use:
- network1
- homefs
- appsfs
- source: community/modules/compute/schedmd-slurm-gcp-v5-partition
kind: terraform
id: compute_partition
use:
- network1
- homefs
- appsfs
- compute_node_group
settings:
enable_shielded_vm: null
machine_type: null
node_count_dynamic_max: null
partition_name: compute
- source: community/modules/scheduler/schedmd-slurm-gcp-v5-controller
kind: terraform
id: slurm_controller
use:
- network1
- debug_partition
- compute_partition
- homefs
- appsfs
settings:
machine_type: n2-standard-8
- source: community/modules/scheduler/schedmd-slurm-gcp-v5-login
kind: terraform
id: slurm_login
use:
- network1
- homefs
- appsfs
- slurm_controller
settings:
machine_type: n2-standard-8
disable_login_public_ips: true
@maxveliaminov , we found a root cause, working on the fix, expect it to be fixed in develop
early next week.
@maxveliaminov, #1406 contains a fix for this problem. Could you please build ghpc
from develop
branch and confirm if it fixes your problem? Please re-open the issue if needed.