If you want to start a new W&B project on the cluster quickly, do the following
- Read and do the "Installation" paragraph
- Copy the sample config file from "Configuration" to
~/.cluster.json
. Modify your host list to match your cluster structure. - Go to the "Getting started" paragraph
It uses SSH to run things on multiple machines without any specific cluster management software.
The project should be somewhere within your home folder (could be in a subfolder). The tool synchronizes the files of the project with all the machines in the cluster and runs specific commands on them.
It is also GPU-aware: it check which GPUs are free on which machine and runs Weights & Biases clients only on them.
It also supports fast setup of environment on all the machines in parallel.
It requires Python 3.6 or newer.
pip3 -r requirements.txt
Checkout the repository, and put the bin
folder in your path.
You need a config file. It can be called cluster.json
or .cluster.json
.
It should be either in your home directory (global) or your local project directory (local).
You can also mix the 2: store global settings in the ~/.cluster.json
and project-specific settings
in the project directory.
An example global configuration file:
{
"hosts": [
"v02.idsia.ch",
"v03.idsia.ch",
"v01.idsia.ch",
"kratos.idsia.ch",
"minsky.idsia.ch",
"asdf",
"skynet.supsi.ch",
"x01.idsia.ch",
"x02.idsia.ch",
"x03.idsia.ch",
"x04.idsia.ch",
"x05.idsia.ch",
"nikola.idsia.ch",
"venus.idsia.ch"
],
"sync": {
"exclude": [".git*"]
},
"wandb" : {
"apikey": "YOUR_API_KEY_FROM_WANDB"
},
"envs": {
"asdf": "LANG=en_US.utf-8"
},
"path": ["~/.local/bin"],
"wandb_ckpt_path": "wandb/*${id}*/files/checkpoint",
"resume_command": "--restore ${ckpt}",
"host_config": {
"x01.idsia.ch": {
"user": "username",
"key": "~/.ssh/id_rsa"
}
}
}
Some fields:
hosts
is just a list of machines you want to use. The script will match prefixes against this list when specifying where to run.host_config
Configuration for individual hosts. Every field is optional.hostname
the full name of the specific hostuser
Usernamekey
The SSH key to use
wandb
Weights & Biases config.apikey
W&B API key that will be passed to all experiment runsproject
which project to useadd_name_argumet
, bool, whether to create a new argument, called name, which is the same as the name of the sweep.
envs
Machine-specific environment variables added to each executed command. Dict of hostnames and the corresponding env. If you want to use it on all hosts, specify "all".setup
List of bash commands to execute when runningct setup
gpu_blacklist
Blacklist specific GPUs on specific machinescommands
Override specific commands on specific machinessync
File synchronization configurationexclude
List of files to exclude. Default [".git*"]use_gitignore
Whether to ignore files in gitignore when sychronizing. True by default.extra
List of additional files/directories to synchronize.
path
List of strings. Add extra lines to the path on the host.paths
List of paths for individual machines. Overwrites the defaultpath
hostname
the target hostname. The argument is a list of strings (the path).
bin_dir
: directory where to put the helper scripts. Machine specific. Dict of hostnames and the corresponding directory. By default~/.local/bin
wandb_ckpt_path
: the relative path of wandb checkpoints. It can contain asterisks and a special ${id} string, which will be replaced by the run id when loading the checkpoint. Default:wandb/*${id}*/files/checkpoint
resume_command
: parametrization to run when resuming a checkpoint. The special string ${ckpt} will be replaced by the checkpoint name. Default:--restore ${ckpt}
nosync_if_exists
detect if used on the target node itself. Disables syncing if the current dir is undertarget_dir``
hostname:
path``` - it checks if this path exists to determine if running locally or not.
Example local config file
{
"wandb" : {
"project": "rnn_generalization_test_release"
},
"setup": [
"pip3 install -U --user cython",
"pip3 install -U --user tqdm",
"pip3 install -U --user psutil",
"pip3 install -U --user matplotlib",
"pip3 install -U --user tensorboard",
"pip3 install -U --user future",
"pip3 install -U --user filelock",
"pip3 install -U --user setproctitle",
"pip3 install -U --user wandb",
"pip3 install -U --user dataclasses",
"pip3 install -U --user pillow",
"pip3 install -U --user torchvision"
]
}
If you have your own account but also part of an organization, you might want to specify in which one to run the sweeps. In that case, just add your entity (username or organization name) in front of wandb.project in your config:
"wandb" : {
"project": "csordas/rnn_generalization_test_release"
},
You can always overwrite the project temporarily using the -p
or --project
argument.
Example: add to you config
"gpu_blacklist": {
"kratos.idsia.ch": [0,1],
"v01.idsia.ch": [2]
}
All commands are checked before it's execution against the command override list. This enables you to use the same commands on the different machines even when they need to be mapped to different things.
Example: add to your config
"commands": {
"asdf": {
"pip3": "~/.local/bin/pip3.6",
"python3": "python3.6"
},
"kratos": {
"pip": "pip3",
"python": "python3"
}
}
All commands can have a target machine, specified by argument -m
. It receives a list of machines, with optional
GPU specification (used just for training). To run a command on multiple machines, look at the following example:
ct -m kratos,v01 run ls
If argument -m
is not used, all machines will be used.
By default the scripts autodetect free GPUs on the target machines.
Specifying GPU can overriden on per-machine basis:
ct -m 'kratos{0;1;2},v01{0-2}' run ls
It can use either a list of GPUs, separated by ;
or a range of them, specified by -
Normally the script doesn't allow to run on GPUs that already have jobs running. However in certain situations some
jobs might be small enough such that one wants to run multiple jobs on a GPU which is already used by some other
process. This can be enforced by adding the -FGPU
flag.
For example:
ct -m kratos -FGPU wandb sweep task.yaml
Run ct setup
in your project directory.
This will copy your SSH ID to all of the machines, so you might be asked to enter your password many times. If you don't yet have an SSH key, it will auto-generate one.
Example:
ct -m kratos wandb sweep sweeps/test.yaml
Optionally you can also specify custom sweep name:
ct -m kratos wandb sweep CoolSweep sweeps/test.yaml
In case the name is not specified, it will use the name of the yaml
file instead (in the example above it
will be 'test')
It will automatically synchronize your files with the target machines (see Synchronization below).
A bit off topic, but here you go, for faster setup
program: main.py
command:
- ${env}
- python3
- ${program}
- ${args}
method: grid
metric:
name: validation/mean_accuracy
goal: maximize
parameters:
log:
value: wandb
profile:
distribution: categorical
values:
- trafo_scan
- scan
scan.train_split:
distribution: categorical
values:
- length
- jump
- turn_left
analysis.enable:
value: 0
stop_after:
value: 25000
mask_loss_weight:
value: 3e-5
mask_lr:
value: 1e-2
sweep_id_for_grid_search:
distribution: categorical
values:
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
ct -m kratos wandb agent <sweep id>
You can copy the command from W&B dashboard and prepend ct
to it.
It will automatically synchronize your files with the target machines (see Synchronization below).
Append -pg <number>
to your starting command (-pg
= per gpu). For example:
ct -m kratos -pg 2 wandb sweep sweeps/test.yaml
Use multiple GPUs on a single machine or cluster.
Append -mgpu <number>
to your starting command. For example:
ct -m kratos -mgpu 2 wandb sweep sweeps/test.yaml
The jobs are started as follows:
- On the head node/master gpu (rank 0), wandb agent is called
- On the rest of the nodes/gpus, the program is called without any arguments. What command is called is extracted from the W&B sweep config, by ignoring the environment variable and arguments. Its the responsibility of the training script to synchronize information between the head node and the rest of the workers.
- In case of non-slurm run, env variables similar to torchrun are set. Thus, code that work with torchrun should work by default with this script.
Use multiple parallel agents. It allocates free GPUs for all of them and run this many in parallel.
Append -n_runs <number>
or -r <number>
to your starting command. For example: If you do not specify this,
the number of runs will be automatically set to cover all variants yaml file.
ct -m kratos -r 4 wandb sweep sweeps/test.yaml
So far this works only with SLURM.
ct -s -m daint wandb resume idsia/lm/3fp2dk2a -mgpu 32
It will resume all the crashed experiments. If you want to resume all, even when flagged "finished", add argument -f
.
Use argument -c <number>
For example if you want to run 10 trainings with 4 parallel agents each of them running on 2 GPUs (which means 2 different nodes on daint):
ct -s -m daint -r 4 -c 10 -mgpu 2 wandb sweep sweeps/test.yaml
It will use 8 gpus in parallel.
Weights & Biases stores all files both locally, and uploaded to the remote server. The local files can take up a lot of space. You can easily clear the files by the following command:
ct -m kratos wandb cleanup
Note: This command is safe to run even while sweeps are running on the target machine. It will check for the running sweeps and will not remove their folders in the W&B cache.
By default, the command assumes that W&B directory is relative to the current dir, named wandb
. You can specify
alternative paths with an extra argument:
ct -m kratos wandb cleanup ~/wandb
Weights & Biases runs sometime show "crashed" status. If you check the log, there is no errors, it just stops at some point. This is due to an internal crash to the syncharonizaton mechansim inside the W&B agent. It can also happen if the iternet stops working for a while. In this case all the logs are stored in a local directory and they can be synchronized with the server. Run
ct wandb sync_crashed
to attempt to fix them. In my experience it works 99% of the time.
It is also possible to specify which sweep you want to synchronize by
ct wandb sync_crashed <sweep id or name>
where <sweep id or name>
can be either the sweep id (the 8 random characters identifying the sweep), or the user
readable name shown in the list of sweeps.
ct -m kratos,v01 copy
It will copy your current working directory to all the target machines. It uses rsync
, so only the modified files
will be transmitted. This ensures that your code on the target machine is in perfect sync with your local one.
You might store some files outside of the project directory. You can add them explicitly to the sychronization list, by adding to your config:
"sync" : {
"extra": ["~/pretrained"]
},
You can copy back output directories from multiple machines (like Synchronization, just the other way around.)
ct -m kratos,v01 gather output
This will download and merge content of the output
folder from the listed machines (into the local output folder).
If the name of files might be in conflict (the same name on multiple machines), then you can use argument -pf
to
the host name as a prefix to them.
ct -m kratos,v01 run 'ls -l'
ct -m kratos,v01 sudo whoami
ct -m kratos,v01 screen run 'python3 main.py'
Sometimes processes can get stuck on the GPUs and using memory. In these cases nvidia-smi
doesn't show them, but
nvidia-smi --query-compute-apps=pid,name --format=csv
does. But the PIDs are invalid. It turns out that these
processes usually don't have parent processes, which can be used to detect them. You can list the PGIDs of these
processes by ct list_phantom
. You can also use it in conjunction with -m
.
For example: ct -m kratos list_phantom
See section on "Listing PGIDs of phantom processes" for more details on what phantom processes are.
ct -m kratos kill_phantom
Note: Killing other user's processes requires sudo.
If something doesn't work, try adding argument -d
. It will display all bash commands it used to do the specific
task. By running them (or parts of them) manually can help you figure out what is the problem.
For example:
ct -m kratos,v01 -d wandb agent <run id>
Commands supported on SLURM clusters works exactly like the locals, except that -s
/--slurm
switch should be
passed such that no command is run accidentally on the cluster. Currently supported commands are copy
,
wandb sweep
and wandb agent
. Expected duration could also be passed to wandb commands in the form of -t hh:mm:ss
(detaults to 23:59:00).
For example to run each configuration in a sweep on a node run:
ct -s -m daint wandb sweep sweep.yaml
For example to run a sweep on 20 nodes for 10 hours:
ct -s -m daint wandb sweep sweep.yaml -r 20 -t 10:00:00
In order for SLURM to work, it needs additional entries in the cluster.json
. The SLURM head node should not be
listed under the "hosts" array, but under a separate "slurm" dict. For example:
"slurm": {
"daint": {
"target_dir": "$SCRATCH",
"modules": ["daint-gpu", "PyExtensions", "PyTorch"],
"account": "your_account",
"cscs_auth": {
"username": "<username>",
"password": "<password>",
"otp_secret": "<otp secret>"
}
}
}
Alternative example, with multiple machines with different GPU types:
"slurm": {
"sc": {
"template": "stanford",
"target_dir": "/some/common/path/slurm",
"machines":[
"bigmachine",
"smallmachine"
],
"default_partition": "standard",
"default_cpus_per_gpu": 8,
"default_mem_per_gpu": 16,
"partition_map": {
"standard": "small,big-standard",
"high": "small-hi,big-hi"
},
"gpu_map":{
"titanxp": ["smallmachine"],
"a100": ["bigmachine"]
}
}
},
Here, headnode
is the hostname for the node which is used to SLURM the slurm commands, and bigmachine
and smallmachine
are the names of individual machines in SLURM.
Obligatory arguments (separately for each target):
target_dir
: the local directory to use instead of /home/username. Can contain remote environment variables.modules
: which modules to load
Optional arguments:
template
:cscs
orstanford
. Usecscs
if you have a big, homogenous machine with 1 GPU per node. Usestanford
otherwise. (The default is cscs for compatiblity reasons).account
: under which accunt to schedule the runs. Runaccounting
remotely if you don't know what's your account.out_dir
: directory where to save output logs. Relative totarget_dir
. By defaultout
cscs_auth
: data for CSCS authentication that requires refreshing the SSH keys every day. You can obtain the secret from the QR code displayed when registering the 2FA, or you can figure it out from a Google Authenticator backup.slurm_flags
: extra SLURM flags to provide to sbatch. For example: "slurm_flags": "--mem-per-cpu=16G". Default: "--constraint=gpu --switches=1" for cscs template. To remove, specify empty string.default_partition
: The default partition to usepartition_map
: Optional map between human-readable and real partition nameshuman readable name
:real name
machines
: list of strings, the short hostname of all machines available. The-m
argument will recognize these machines and will allow to run on them directly.gpu_map
: Dict of GPU name (arbitrary string) and list of machines (short hostnames). If present,machines
must be defined as well.default_cpus_per_gpu
: How many cpus to use per GPUdefault_mem_per_gpu
: How many Gb of memory to use per GPU
If you can't directly access the SLURM server, but it is behind another front-end server, you can add a similar entry to
your local ~/.ssh/config
:
Host ela
Hostname ela.cscs.ch
User <username>
IdentityFile ~/.ssh/id_rsa_cscs
Host daint
Hostname daint.cscs.ch
User <username>
IdentityFile ~/.ssh/id_rsa_cscs
ProxyJump ela
Here the ProxyJump
line makes all connections going to daint
go through ela
.
Note that in this case you should not add the full url the config.json
, but the name used after Host
in
the ~/.ssh/config
file (in this case just daint
).
It's recommended to use W&B for our server experiments. For that, you will need to set up clustertool with your
W&B API key. Go to https://app.wandb.ai/settings, scroll to "API keys", click "New key" and copy your key in the config
file. I recommend to put it in your global config file (~/.cluster.json
), but it also works locally.
Create a local config file in your project folder, called cluster.json
, as follows:
{
"wandb" : {
"project": "lr_tuning_test"
}
}
Now you should setup your cluster:
ct setup
You have to do this only once, or when python version on the hosts are updated, or the setup
section of the
config file is changed.
In your training script, you want to log things to W&B. Import 'wandb' and initialize it as follows:
wandb.init()
Log your loss with wandb.log({"loss": loss})
periodically.
Finally call wandb.join()
when your training terminates.
To tune for example the learning rate, use argument parser, and create a new argument "lr". Parse and use this learning rate in your optimizer:
import argparse
args = argparse.ArgumentParser()
args.add_argument("-lr", "--lr", type=float)
opt = args.parse_args()
...
optim = torch.optim.Adam(model.parameters(), lr=opt.lr)
Start a W&B sweep as follows. First create a yaml config file:
program: main.py
command:
- ${env}
- python3
- ${program}
- ${args}
method: bayes
metric:
name: loss
goal: minimize
parameters:
lr:
min: 0.001
max: 0.1
Save it as lr_tuning.yaml
.
Next run 100 iterations of the training:
ct -m kratos,v01 -c 100 wandb sweep lr_tuning.yaml
Go to https://app.wandb.ai/, open your sweeps, and you should see your runs there.
Go to W&B sweeps page. If you don't need the data, delete the sweep. It will stop all agents immediately.
If you need your data, go to the sweep, go to the sweep config on the left side menu, and cancel the sweep.
If it still doesn't stop, you can always use ct run 'killall -9 wandb'
followed by
ct run 'killall -9 python3'
which will terminate the runs for sure. (Note: the order is important, otherwise W&B
will start new runs immediately).
Add your conda bin path to either path
or paths
.