Default AMI Fails to detect Nvidia driver on AWS g6e
Opened this issue · 4 comments
Getting:
Digest: sha256:5e8ed922ecacdb1071096eebef5af11563fd0c2c8bce9143ea3898768994080f
Status: Downloaded newer image for iterativeai/cml:0-dvc3-base1-gpu
docker.io/iterativeai/cml:0-dvc3-base1-gpu
/usr/bin/docker create --name 41bde5f6557b4c82bb0400b08e5ca5b0_iterativeaicml0dvc3base1gpu_78f5fb --label 380bf3 --workdir /__w/SecretModels/SecretModels --network github_network_5168857de2994b2fabc54139db02ee1f --gpus all -e "HOME=/github/home" -e GITHUB_ACTIONS=true -e CI=true -v "/var/run/docker.sock":"/var/run/docker.sock" -v "/tmp/tmp.hn0PBn3aGx/.cml/cml-6xoxrssodk-1rrodbwp-hli9n6jg/_work":"/__w" -v "/tmp/tmp.hn0PBn3aGx/.cml/cml-6xoxrssodk-1rrodbwp-hli9n6jg/externals":"/__e":ro -v "/tmp/tmp.hn0PBn3aGx/.cml/cml-6xoxrssodk-1rrodbwp-hli9n6jg/_work/_temp":"/__w/_temp" -v "/tmp/tmp.hn0PBn3aGx/.cml/cml-6xoxrssodk-1rrodbwp-hli9n6jg/_work/_actions":"/__w/_actions" -v "/tmp/tmp.hn0PBn3aGx/.cml/cml-6xoxrssodk-1rrodbwp-hli9n6jg/_work/_tool":"/__w/_tool" -v "/tmp/tmp.hn0PBn3aGx/.cml/cml-6xoxrssodk-1rrodbwp-hli9n6jg/_work/_temp/_github_home":"/github/home" -v "/tmp/tmp.hn0PBn3aGx/.cml/cml-6xoxrssodk-1rrodbwp-hli9n6jg/_work/_temp/_github_workflow":"/github/workflow" --entrypoint "tail" iterativeai/cml:0-dvc
215e811a3e2b95ada680b3f3db404ac68abec62295af2daf3a516db6e0d4099a
/usr/bin/docker start 215e811a3e2b95ada680b3f3db404ac68abec62295af2daf3a516db6e0d4099a
Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown
Error: failed to start containers: 215e811a3e2b95ada680b3f3db404ac68abec62295af2daf3a516db6e0d4099a
Error: Docker start fail with exit code 1
from setup like:
name: model-style-train-on-manual_call
on:
workflow_dispatch:
inputs:
model_name:
description: 'Hugging Face model name to use for training'
required: true
default: 'euclaise/gpt-neox-122m-minipile-digits'
jobs:
launch-runner:
runs-on: ubuntu-latest
permissions:
contents: write
actions: write
steps:
- uses: actions/setup-node@v3
with:
node-version: '16'
- uses: actions/setup-python@v4
with:
python-version: '3.x'
- uses: actions/checkout@v3
- uses: iterative/setup-cml@v2
- name: Deploy runner on EC2
env:
REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
AWS_ACCESS_KEY_ID: ${{ secrets.CML_AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.CML_AWS_SECRET_ACCESS_KEY }}
run: |
cml runner launch \
--cloud=aws \
--cloud-hdd-size=256 \
--cloud-region=us-west-2 \
--cloud-type=g6e.xlarge \
--cloud-gpu=v100 \
--labels=cml-gpu
run:
needs: launch-runner
runs-on: [self-hosted, cml-gpu ]
container:
image: docker://iterativeai/cml:0-dvc3-base1-gpu
options: --gpus all
timeout-minutes: 40000
permissions:
contents: read
actions: write
steps:
- uses: actions/setup-node@v3
with:
node-version: '16'
- uses: actions/checkout@v3
- uses: robinraju/release-downloader@v1
with:
tag: 'style'
fileName: '*.jsonl'
- name: Train models
env:
GITHUB_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
REPO_TOKEN: ${{ github.token }}
DEBIAN_FRONTEND: noninteractive
MODEL_NAME: ${{ github.event.inputs.model_name }}
run: |
echo $NODE_OPTIONS
This issue could be literally anything related to GPU drivers.
Please run a non-GPU workload like sleep infinity
and SSH into the instance using either these instructions or e.g. mxschmitt/action-tmate
; then take a look to journalctl
in case GPU drivers failed to install, run nvidia-smi
to check if the host detects the GPU outside the container runtime...
I currently can't be of much help, but with these hints you should be able to find out what's happening.
@OLSecret @0x2b3bfa0 is it possible to mention AMI Id which I have in AWS .. private or public one ? Because I am getting a 11.4 cuda version with T4 GPU while launching a g4dn.xlarge instance. Where as in manual method I am getting 12.4 with T4 GPU