to run bash session on entropy cluster:
srun --partition=common --qos=1gpu4h --time=1:00:00 --gres=gpu:1 --pty /bin/bash
instalation on entropy cluster (on GPU node, not connect node):
python3 -m venv env
source env/bin/activate
pip install -r requirements.txt
benchmark installation:
cd benchmarks/pl-mteb
python -m venv env
source env/bin/activate
pip install -r requirements.txt
pip install scipy==1.10.1
pip install mteb==1.12.25
pip install pydantic==2.7.2
to configure wandb install it with pip and run
export WANDB_API_KEY=$(cat /path/to/secure/file)
wandb login $WANDB_API_KEY
to configure huggingface read/write install huggingface_cli with pip and run
git config --global credential.helper store
huggingface-cli login
to submit a training job:
sbatch slurm/entropy/run_train.sh
on entropy Titan-V
precision: "bf16-mixed"
won't work, it should be removed
to submit a benchmark job: edit model name in this file and run
sbatch slurm/entropy/run_pl_mteb.sh
Datasets types follows the one availabe here
AWS setup through sky-pilot
pip install "skypilot-nightly[aws]" boto3
And you need to create Access Key and have quota for spot instances with GPU!
to run pl-mteb with sky-pilot:
sky spot launch \
--env WANDB_API_KEY=$WANDB_API_KEY \
--env HF_TOKEN=$HF_TOKEN \
--env MODEL_NAME=$MODEL_NAME \
sky/sky_run_pl-mteb.yml
by default sky-pilot uses quite powerful instance as a controll node which in cases when you want to run one spot machine with GPU results in paying more for control node. It can be overriden:
~/.sky/config.yaml
jobs:
controller:
resources:
cloud: aws
region: us-east-1
cpus: 2