torchmd/torchmd-net

Receiving CPUAccelerator error when clearly specifying core count

Closed this issue · 2 comments

I was trying to train a model from the QM9 checkpoint with the following dataset: 'epoch=649-val_loss=0.0003-test_loss=0.0059.ckpt'. I also used the following hyperparameters:

y_weight: 1.0
force_files: null
neg_dy_weight: 0.0
gradient_clipping: 40
inference_batch_size: 128
load_model: null
log_dir: /home/rgujral/torchmd/tutorial/logs/
lr: 0.0001
lr_factor: 0.8
lr_min: 1.0e-07
lr_patience: 15
lr_warmup_steps: 1000
max_num_neighbors: 64
max_z: 128
model: tensornet
ngpus: -1
num_epochs: 3000
num_layers: 3
num_nodes: 1
num_rbf: 64
num_workers: 6
output_model: Scalar
precision: 32
prior_model: Atomref
rbf_type: expnorm
redirect: false
reduce_op: add
save_interval: 10
seed: 1
splits: null
standardize: false
test_interval: 20
test_size: null
train_size: 110000
trainable_rbf: false
val_size: 10000
weight_decay: 0.0
box_vecs: null
charge: false
spin: false

When creating a batch script and allocating 20 cpu cores and 0 gpu cores, the script fails shortly after executing and leaves the following error messages in an error output folder:

torchmd-train 11
sys.exit(main())

train.py 203 main
trainer = pl.Trainer(

argparse.py 70 insert_env_defaults
return fn(self, **kwargs)

trainer.py 400 init
self._accelerator_connector = _AcceleratorConnector(

accelerator_connector.py 146 init
self._set_parallel_devices_and_init_accelerator()

accelerator_connector.py 376 _set_parallel_devices_and_init_accelerator
self._devices_flag = accelerator_cls.parse_devices(self._devices_flag)

cpu.py 53 parse_devices
return _parse_cpu_cores(devices)

cpu.py 94 _parse_cpu_cores
raise TypeError("devices selected with CPUAccelerator should be an int > 0.")

TypeError:
devices selected with CPUAccelerator should be an int > 0.

Facing the same issue here. Help!

Pass ngpus=20 (or whatever number of CPUs you want to use) when training.

Explanation

The "ngpus" parameter is passed directly to the Lightning Trainer object:

trainer = pl.Trainer(
strategy="auto",
max_epochs=args.num_epochs,
accelerator="auto",
devices=args.ngpus,
num_nodes=args.num_nodes,
default_root_dir=args.log_dir,
callbacks=[early_stopping, checkpoint_callback],
logger=_logger,
precision=args.precision,
gradient_clip_val=args.gradient_clipping,
inference_mode=False,
# Test-during-training requires reloading the dataloaders every epoch
reload_dataloaders_every_n_epochs=1 if args.test_interval > 0 else 0,
)

The accelerator is set as "auto" (which favors GPUs I reckon).
The name "ngpus" is thus a bad one, it should be "devices" or smth like that.

For whatever reason the GPU mode in the Lightning Trainer allows to pass "-1" meaning "use all available", but the CPU mode requires an actual positive, non-zero value.

Feel free to reopen if this does not solve your issue.