iterative/cml

Default AMI Fails to detect Nvidia driver on AWS g6e

Opened this issue · 4 comments

Getting:

Digest: sha256:5e8ed922ecacdb1071096eebef5af11563fd0c2c8bce9143ea3898768994080f
  Status: Downloaded newer image for iterativeai/cml:0-dvc3-base1-gpu
  docker.io/iterativeai/cml:0-dvc3-base1-gpu
  /usr/bin/docker create --name 41bde5f6557b4c82bb0400b08e5ca5b0_iterativeaicml0dvc3base1gpu_78f5fb --label 380bf3 --workdir /__w/SecretModels/SecretModels --network github_network_5168857de2994b2fabc54139db02ee1f --gpus all -e "HOME=/github/home" -e GITHUB_ACTIONS=true -e CI=true -v "/var/run/docker.sock":"/var/run/docker.sock" -v "/tmp/tmp.hn0PBn3aGx/.cml/cml-6xoxrssodk-1rrodbwp-hli9n6jg/_work":"/__w" -v "/tmp/tmp.hn0PBn3aGx/.cml/cml-6xoxrssodk-1rrodbwp-hli9n6jg/externals":"/__e":ro -v "/tmp/tmp.hn0PBn3aGx/.cml/cml-6xoxrssodk-1rrodbwp-hli9n6jg/_work/_temp":"/__w/_temp" -v "/tmp/tmp.hn0PBn3aGx/.cml/cml-6xoxrssodk-1rrodbwp-hli9n6jg/_work/_actions":"/__w/_actions" -v "/tmp/tmp.hn0PBn3aGx/.cml/cml-6xoxrssodk-1rrodbwp-hli9n6jg/_work/_tool":"/__w/_tool" -v "/tmp/tmp.hn0PBn3aGx/.cml/cml-6xoxrssodk-1rrodbwp-hli9n6jg/_work/_temp/_github_home":"/github/home" -v "/tmp/tmp.hn0PBn3aGx/.cml/cml-6xoxrssodk-1rrodbwp-hli9n6jg/_work/_temp/_github_workflow":"/github/workflow" --entrypoint "tail" iterativeai/cml:0-dvc
  215e811a3e2b95ada680b3f3db404ac68abec62295af2daf3a516db6e0d4099a
  /usr/bin/docker start 215e811a3e2b95ada680b3f3db404ac68abec62295af2daf3a516db6e0d4099a
  Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
  nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown
  Error: failed to start containers: 215e811a3e2b95ada680b3f3db404ac68abec62295af2daf3a516db6e0d4099a
  Error: Docker start fail with exit code 1

from setup like:

name: model-style-train-on-manual_call

on:
  workflow_dispatch:
    inputs:
      model_name:
        description: 'Hugging Face model name to use for training'
        required: true
        default: 'euclaise/gpt-neox-122m-minipile-digits'

jobs:
  launch-runner:
    runs-on: ubuntu-latest
    permissions:
      contents: write
      actions: write
    steps:
      - uses: actions/setup-node@v3
        with:
          node-version: '16'
      - uses: actions/setup-python@v4
        with:
           python-version: '3.x'
      - uses: actions/checkout@v3
      - uses: iterative/setup-cml@v2
      - name: Deploy runner on EC2
        env:
          REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
          AWS_ACCESS_KEY_ID: ${{ secrets.CML_AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.CML_AWS_SECRET_ACCESS_KEY }}
        run: |
          cml runner launch \
              --cloud=aws \
              --cloud-hdd-size=256 \
              --cloud-region=us-west-2 \
              --cloud-type=g6e.xlarge \
              --cloud-gpu=v100 \
              --labels=cml-gpu 

  run:
    needs: launch-runner
    runs-on: [self-hosted, cml-gpu ]
    container:
      image: docker://iterativeai/cml:0-dvc3-base1-gpu
      options: --gpus all
    timeout-minutes: 40000
    permissions:
      contents: read
      actions: write
    steps:
      - uses: actions/setup-node@v3
        with:
          node-version: '16'
      - uses: actions/checkout@v3
      - uses: robinraju/release-downloader@v1
        with:
          tag: 'style'
          fileName: '*.jsonl'
      - name: Train models
        env:
          GITHUB_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
          REPO_TOKEN: ${{ github.token }}
          DEBIAN_FRONTEND: noninteractive
          MODEL_NAME: ${{ github.event.inputs.model_name }}
        run: |
          echo $NODE_OPTIONS

This issue could be literally anything related to GPU drivers.

Please run a non-GPU workload like sleep infinity and SSH into the instance using either these instructions or e.g. mxschmitt/action-tmate; then take a look to journalctl in case GPU drivers failed to install, run nvidia-smi to check if the host detects the GPU outside the container runtime...

I currently can't be of much help, but with these hints you should be able to find out what's happening.

@OLSecret @0x2b3bfa0 is it possible to mention AMI Id which I have in AWS .. private or public one ? Because I am getting a 11.4 cuda version with T4 GPU while launching a g4dn.xlarge instance. Where as in manual method I am getting 12.4 with T4 GPU