GoogleCloudPlatform/cluster-toolkit

NFS server file system bug

maxveliaminov opened this issue · 4 comments

If you find a similar existing issue, please comment on that issue instead of creating a new one.

If you are submitting a feature request, please start a discussion instead of creating an issue.

Describe the bug

When using nfs-server as a file system boot disk of the nfs-server instance is shared while attached disk remains unmounted, which leads to smaller then expected shared volume size, also in this configuration additional disk contains cent os and has 20gb file system. In some cases it could be other way around when attached disk gets mounted as root and is shared and boot disk remains unmounted.

Steps to reproduce

Steps to reproduce the behavior:

  1. Create hpc cluster project with nfs-server as file system

Expected behavior

Boot disk mounted as root additional disk mounted as /exports/data

Actual behavior

Boot disk mounted as root additional disk is not mounted at all OR Additional disk mounted as root, boot disk is not mounted at all

Version (ghpc --version)

Blueprint

If applicable, attach or paste the blueprint YAML used to produce the bug.

Expanded Blueprint

If applicable, please attach or paste the expanded blueprint. The expanded blueprint can be obtained by running ghpc expand your-blueprint.yaml.

Disregard if the bug occurs when running ghpc expand ... as well.

Output and logs


Screenshots

Nfs server on the left, controller node on the right
image

Execution environment

  • OS: [macOS, ubuntu, ...]
  • Shell (To find this, run ps -p $$): [bash, zsh, ...]
  • go version:

Additional context

Add any other context about the problem here.

Hi @maxveliaminov , could you please share your blueprint (exclude any sensitive information)?

Hi @mr0re1 here is one,

blueprint_name: palm-model

vars:
  project_id: <PROJECT_ID>
  deployment_name: <PROJECT_ID>
  region: <REGION>
  zone: <ZONE>
  machine_type: <MACHINE_TYPE>
  node_count_dynamic_max: <NODE_COUNT_DYNAMIC_MAX>
  slurm_cluster_name: palm1
  disable_public_ips: true
  enable_shielded_vm: true

deployment_groups:
  - group: primary
    modules:
      - id: network1
        source: modules/network/vpc
        kind: terraform
      - id: appsfs
        source: community/modules/file-system/nfs-server
        kind: terraform
        use:
          - network1
        settings:
          machine_type: n2-standard-2
          auto_delete_disk: true
          local_mounts: ['/apps']
      - id: spack
        source: community/modules/scripts/spack-install
        settings:
          install_dir: /apps/spack
          spack_url: https://github.com/spack/spack
          spack_ref: v0.19.1
          log_file: /apps/spack.log
          spack_cache_url:
            - mirror_name: <SPACK_CACHE_NAME>
              mirror_url: <SPACK_CACHE_URL>
          configs:
            - type: file
              scope: defaults
              content: |
                modules:
                  default:
                    tcl:
                      hash_length: 0
                      all:
                        conflict:
                          - '{name}'
                      projections:
                        all: '{name}/{version}-{compiler.name}-{compiler.version}'
          compilers:
            - gcc@8.2.0%gcc@4.8.5 target=x86_64
          environments:
            - name: palm
              content: |
                spack:
                  definitions:
                  - compilers:
                    - gcc@8.2.0
                  - mpis:
                    - intel-mpi@2018.4.274
                  - python:
                    - python@3.9.10
                  - python_packages:
                    - py-pip@22.2.2
                    - py-wheel@0.37.1
                    - py-google-cloud-storage@1.18.0
                    - py-ansible@2.9.2
                  - packages:
                    - gcc@8.2.0
                    - coreutils@8.32
                    - cmake@3.24.3
                    - flex@2.6.4
                    - bison@3.8.2
                  - mpi_packages:
                    - netcdf-c@4.7.4
                    - netcdf-fortran@4.5.3
                    - parallel-netcdf@1.12.2
                    - fftw@3.3.10
                  specs:
                  - matrix:
                    - - $packages
                    - - $%compilers
                  - matrix:
                    - - $python
                    - - $%compilers
                  - matrix:
                    - - $python_packages
                    - - $%compilers
                    - - $^python
                  - matrix:
                    - - $mpis
                    - - $%compilers
                  - matrix:
                    - - $mpi_packages
                    - - $%compilers
                    - - $^mpis

      - id: spack_startup
        source: modules/scripts/startup-script
        kind: terraform
        use:
          - network1
        settings:
          runners:
            - $(appsfs.mount_runner)
            - $(spack.install_spack_deps_runner)
            - $(spack.install_spack_runner)
            - type: data
              destination: /apps/palm/palm-install.yaml
              content: |
              123
            - type: data
              destination: /apps/spack/activate-palm-env.sh
              content: |
              456
            - type: data
              destination: /apps/palm/palm-install.sh
              content: |
              789
            - type: shell
              content: sudo chmod -R 777 /apps
              destination: chmod-apps-dir.sh
            - type: shell
              content: 'shutdown -h now'
              destination: shutdown.sh

      - id: spack_builder
        source: modules/compute/vm-instance
        kind: terraform
        use:
          - network1
          - appsfs
          - spack_startup
        settings:
          name_prefix: spack-builder
      - id: homefs
        source: community/modules/file-system/nfs-server
        kind: terraform
        use:
          - network1
        settings:
          machine_type: n2-standard-2
          auto_delete_disk: true
          local_mounts: ['/home']
      - id: debug_node_group
        source: community/modules/compute/schedmd-slurm-gcp-v5-node-group
        use:
          - network1
          - homefs
          - appsfs
        settings:
          node_count_dynamic_max: <DEBUG_MAX_NODE_COUNT>

      - source: community/modules/compute/schedmd-slurm-gcp-v5-partition
        kind: terraform
        id: debug_partition
        use:
          - network1
          - homefs
          - appsfs
          - debug_node_group
        settings:
          is_default: true
          enable_shielded_vm: null
          machine_type: null
          node_count_dynamic_max: null
          partition_name: debug

      - id: compute_node_group
        source: community/modules/compute/schedmd-slurm-gcp-v5-node-group
        use:
          - network1
          - homefs
          - appsfs

      - source: community/modules/compute/schedmd-slurm-gcp-v5-partition
        kind: terraform
        id: compute_partition
        use:
          - network1
          - homefs
          - appsfs
          - compute_node_group
        settings:
          enable_shielded_vm: null
          machine_type: null
          node_count_dynamic_max: null
          partition_name: compute

      - source: community/modules/scheduler/schedmd-slurm-gcp-v5-controller
        kind: terraform
        id: slurm_controller
        use:
          - network1
          - debug_partition
          - compute_partition
          - homefs
          - appsfs
        settings:
          machine_type: n2-standard-8

      - source: community/modules/scheduler/schedmd-slurm-gcp-v5-login
        kind: terraform
        id: slurm_login
        use:
          - network1
          - homefs
          - appsfs
          - slurm_controller
        settings:
          machine_type: n2-standard-8
          disable_login_public_ips: true

@maxveliaminov , we found a root cause, working on the fix, expect it to be fixed in develop early next week.

@maxveliaminov, #1406 contains a fix for this problem. Could you please build ghpc from develop branch and confirm if it fixes your problem? Please re-open the issue if needed.