Receiving CPUAccelerator error when clearly specifying core count
Closed this issue · 2 comments
I was trying to train a model from the QM9 checkpoint with the following dataset: 'epoch=649-val_loss=0.0003-test_loss=0.0059.ckpt'. I also used the following hyperparameters:
y_weight: 1.0
force_files: null
neg_dy_weight: 0.0
gradient_clipping: 40
inference_batch_size: 128
load_model: null
log_dir: /home/rgujral/torchmd/tutorial/logs/
lr: 0.0001
lr_factor: 0.8
lr_min: 1.0e-07
lr_patience: 15
lr_warmup_steps: 1000
max_num_neighbors: 64
max_z: 128
model: tensornet
ngpus: -1
num_epochs: 3000
num_layers: 3
num_nodes: 1
num_rbf: 64
num_workers: 6
output_model: Scalar
precision: 32
prior_model: Atomref
rbf_type: expnorm
redirect: false
reduce_op: add
save_interval: 10
seed: 1
splits: null
standardize: false
test_interval: 20
test_size: null
train_size: 110000
trainable_rbf: false
val_size: 10000
weight_decay: 0.0
box_vecs: null
charge: false
spin: false
When creating a batch script and allocating 20 cpu cores and 0 gpu cores, the script fails shortly after executing and leaves the following error messages in an error output folder:
torchmd-train 11
sys.exit(main())
train.py 203 main
trainer = pl.Trainer(
argparse.py 70 insert_env_defaults
return fn(self, **kwargs)
trainer.py 400 init
self._accelerator_connector = _AcceleratorConnector(
accelerator_connector.py 146 init
self._set_parallel_devices_and_init_accelerator()
accelerator_connector.py 376 _set_parallel_devices_and_init_accelerator
self._devices_flag = accelerator_cls.parse_devices(self._devices_flag)
cpu.py 53 parse_devices
return _parse_cpu_cores(devices)
cpu.py 94 _parse_cpu_cores
raise TypeError("devices
selected with CPUAccelerator
should be an int > 0.")
TypeError:
devices
selected with CPUAccelerator
should be an int > 0.
Facing the same issue here. Help!
Pass ngpus=20 (or whatever number of CPUs you want to use) when training.
Explanation
The "ngpus" parameter is passed directly to the Lightning Trainer object:
torchmd-net/torchmdnet/scripts/train.py
Lines 203 to 217 in 6c42c8b
The accelerator is set as "auto" (which favors GPUs I reckon).
The name "ngpus" is thus a bad one, it should be "devices" or smth like that.
For whatever reason the GPU mode in the Lightning Trainer allows to pass "-1" meaning "use all available", but the CPU mode requires an actual positive, non-zero value.
Feel free to reopen if this does not solve your issue.