basic pytorch repo to check functionality (for job submission, multi gpu usage, ...)
make sure that log
folder exists
sbatch --job-name=NAME --output=log/%j.out --gres=gpu:1 --mem=10G subscript.sh SCRIPT_PARAMS
interactive debugging shell
srun --time 10 --partition=gpu.debug --gres=gpu:1 --pty bash -i
not working yet
- faster training on several GPUs: data parallel
- model too large for single GPU: model parallel to split across multiple GPUs
- 1 process per 1 GPU in DDP
- same model parameters & optimizers, but we split the data (
DistributedSampler
)