Use of kernel 5.4 in base AWS image
Closed this issue · 2 comments
OLSecret commented
Summary / Background
I run g5.12xlarge 4 gpus with Trainer and Accelerate. Getting Old kernel issue (5.5 needed, 5.4 given)
Scope
Accelerate fails with
Map: 100%|██████████| 4505/4505 [00:01<00:00, 3794.45 examples/s]
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
You are adding a <class 'transformers.integrations.integration_utils.WandbCallback'> to the callbacks of this Trainer, but there is already one. The currentlist of callbacks is
:DefaultFlowCallback
WandbCallback
0%| | 0/9950 [00:00<?, ?it/s]Bus error (core dumped)
I start instance like so:
...
- name: Deploy runner on EC2
env:
REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
AWS_ACCESS_KEY_ID: ${{ secrets.CML_AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.CML_AWS_SECRET_ACCESS_KEY }}
run: |
cml runner launch \
--cloud=aws \
--cloud-region=us-west-2 \
--cloud-gpu=v100 \
--cloud-hdd-size=125 \
--cloud-type=g5.12xlarge \
--labels=cml-gpu \
run:
needs: launch-runner
runs-on: [ cml-gpu ]
container:
image: docker://iterativeai/cml:0-dvc2-base1-gpu
options: --gpus all --network=host
timeout-minutes: 2800 # 2 days
permissions: write-all
steps:
- uses: actions/setup-node@v1
with:
node-version: '16'
- uses: actions/checkout@v3
- name: Set up Python 3.10
uses: actions/setup-python@v4
with:
python-version: '3.10'
- name: Train models
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }}
OPENAI_SECRET_KEY: ${{ secrets.OPENAI_SK }}
run: |
apt update && apt install -y libc6 zip python-packaging git
nvidia-smi
pip install --upgrade pip
pip install packaging torch==2.1.1 --index-url https://download.pytorch.org/whl/cu118
pip install -r backend/requirements_training.txt
pip install git+https://github.com/huggingface/accelerate
pip install git+https://github.com/huggingface/transformers
echo "# CML report" >> train_report.md
wandb login ${{ secrets.WANDB_KEY }}
cml comment update --watch train_report.md &
python -m backend.management.commands.train_models_experimental
...
How do we get AWS to run the base image with a fresher kernel under the CML container?
0x2b3bfa0 commented
Hello, @OLSecret! Consider using the undocumented cml runner launch --cloud-image
option to choose a more recent machine image. This option accepts any valid AWS AMI identifier for the --cloud-region
you've chosen.
OLSecret commented
Nice, thank you, I will try that.