facebookresearch/d2go

lightning_train_net.py multi-gpu

austinmw opened this issue ยท 4 comments

Instructions To Reproduce the ๐Ÿ› Bug:

  1. Full runnable code or full changes you made:
    Mainline install with no changes at first. When first running lightning_train_net.py, it only used a single GPU even with --num-gpus 4, so I then added 'gpus': args.num_gpus to get_trainer_params return dict.

  2. What exact command you run:
    python tools/lightning_train_net.py --config-file configs/faster_rcnn_fbnetv3a_C4.yaml --runner d2go.runner.lightning_task.GeneralizedRCNNTask (Adding 'gpus': 4 to Trainer args)

  3. Full logs or other relevant observations:

Traceback (most recent call last):
File "/home/ubuntu/det2/d2go/tools/lightning_train_net.py", line 224, in
ret = main(
File "/home/ubuntu/det2/d2go/tools/lightning_train_net.py", line 181, in main
model_configs = do_train(cfg, trainer, task)
File "/home/ubuntu/det2/d2go/tools/lightning_train_net.py", line 122, in do_train
trainer.fit(task)
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 771, in fit
self._call_and_handle_interrupt(
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 722, in _call_and_handle_interrupt
return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
return function(*args, **kwargs)
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 812, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1236, in _run
results = self._run_stage()
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1323, in _run_stage
return self._run_train()
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1353, in _run_train
self.fit_loop.run()
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 199, in run
self.on_run_start(*args, **kwargs)
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 221, in on_run_start
self.trainer.reset_train_val_dataloaders(self.trainer.lightning_module)
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1954, in reset_train_val_dataloaders
self.reset_val_dataloader(model=model)
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1907, in reset_val_dataloader
self.num_val_batches, self.val_dataloaders = self._data_connector._reset_eval_dataloader(
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 384, in _reset_eval_dataloader
dataloaders = self._request_dataloader(mode, model=model)
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 463, in _request_dataloader
dataloader = source.dataloader()
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 535, in dataloader
return self.instance.trainer._call_lightning_module_hook(self.name, pl_module=self.instance)
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1595, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/d2go/runner/lightning_task.py", line 333, in val_dataloader
return self._evaluation_dataloader()
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/d2go/runner/lightning_task.py", line 326, in _evaluation_dataloader
self._reset_dataset_evaluators()
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/d2go/runner/lightning_task.py", line 263, in _reset_dataset_evaluators
or self.trainer._accelerator_connector.use_ddp
AttributeError: 'AcceleratorConnector' object has no attribute 'use_ddp'
Traceback (most recent call last):
File "/home/ubuntu/det2/d2go/tools/lightning_train_net.py", line 224, in
ret = main(
File "/home/ubuntu/det2/d2go/tools/lightning_train_net.py", line 181, in main
model_configs = do_train(cfg, trainer, task)
File "/home/ubuntu/det2/d2go/tools/lightning_train_net.py", line 122, in do_train
trainer.fit(task)
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 771, in fit
self._call_and_handle_interrupt(
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 722, in _call_and_handle_interrupt
return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
return function(*args, **kwargs)
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 812, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1236, in _run
results = self._run_stage()
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1323, in _run_stage
return self._run_train()
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1353, in _run_train
self.fit_loop.run()
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 199, in run
self.on_run_start(*args, **kwargs)
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 221, in on_run_start
self.trainer.reset_train_val_dataloaders(self.trainer.lightning_module)
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1954, in reset_train_val_dataloaders
self.reset_val_dataloader(model=model)
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1907, in reset_val_dataloader
self.num_val_batches, self.val_dataloaders = self._data_connector._reset_eval_dataloader(
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 384, in _reset_eval_dataloader
dataloaders = self._request_dataloader(mode, model=model)
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 463, in _request_dataloader
dataloader = source.dataloader()
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 535, in dataloader
return self.instance.trainer._call_lightning_module_hook(self.name, pl_module=self.instance)
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1595, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/d2go/runner/lightning_task.py", line 333, in val_dataloader
return self._evaluation_dataloader()
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/d2go/runner/lightning_task.py", line 326, in _evaluation_dataloader
self._reset_dataset_evaluators()
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/d2go/runner/lightning_task.py", line 263, in _reset_dataset_evaluators
or self.trainer._accelerator_connector.use_ddp
AttributeError: 'AcceleratorConnector' object has no attribute 'use_ddp'
Traceback (most recent call last):
File "tools/lightning_train_net.py", line 224, in
ret = main(
File "tools/lightning_train_net.py", line 181, in main
model_configs = do_train(cfg, trainer, task)
File "tools/lightning_train_net.py", line 122, in do_train
trainer.fit(task)
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 771, in fit
self._call_and_handle_interrupt(
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 722, in _call_and_handle_interrupt
return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
return function(*args, **kwargs)
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 812, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1236, in _run
results = self._run_stage()
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1323, in _run_stage
return self._run_train()
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1353, in _run_train
self.fit_loop.run()
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 199, in run
self.on_run_start(*args, **kwargs)
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 221, in on_run_start
self.trainer.reset_train_val_dataloaders(self.trainer.lightning_module)
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1954, in reset_train_val_dataloaders
self.reset_val_dataloader(model=model)
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1907, in reset_val_dataloader
self.num_val_batches, self.val_dataloaders = self._data_connector._reset_eval_dataloader(
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 384, in _reset_eval_dataloader
dataloaders = self._request_dataloader(mode, model=model)
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 463, in _request_dataloader
dataloader = source.dataloader()
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 535, in dataloader
return self.instance.trainer._call_lightning_module_hook(self.name, pl_module=self.instance)
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1595, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/d2go/runner/lightning_task.py", line 333, in val_dataloader
return self._evaluation_dataloader()
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/d2go/runner/lightning_task.py", line 326, in _evaluation_dataloader
self._reset_dataset_evaluators()
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/d2go/runner/lightning_task.py", line 263, in _reset_dataset_evaluators
or self.trainer._accelerator_connector.use_ddp
AttributeError: 'AcceleratorConnector' object has no attribute 'use_ddp'
Traceback (most recent call last):
File "/home/ubuntu/det2/d2go/tools/lightning_train_net.py", line 224, in
ret = main(
File "/home/ubuntu/det2/d2go/tools/lightning_train_net.py", line 181, in main
model_configs = do_train(cfg, trainer, task)
File "/home/ubuntu/det2/d2go/tools/lightning_train_net.py", line 122, in do_train
trainer.fit(task)
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 771, in fit
self._call_and_handle_interrupt(
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 722, in _call_and_handle_interrupt
return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
return function(*args, **kwargs)
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 812, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1236, in _run
results = self._run_stage()
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1323, in _run_stage
return self._run_train()
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1353, in _run_train
self.fit_loop.run()
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 199, in run
self.on_run_start(*args, **kwargs)
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 221, in on_run_start
self.trainer.reset_train_val_dataloaders(self.trainer.lightning_module)
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1954, in reset_train_val_dataloaders
self.reset_val_dataloader(model=model)
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1907, in reset_val_dataloader
self.num_val_batches, self.val_dataloaders = self._data_connector._reset_eval_dataloader(
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 384, in _reset_eval_dataloader
dataloaders = self._request_dataloader(mode, model=model)
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 463, in _request_dataloader
dataloader = source.dataloader()
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 535, in dataloader
return self.instance.trainer._call_lightning_module_hook(self.name, pl_module=self.instance)
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1595, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/d2go/runner/lightning_task.py", line 333, in val_dataloader
return self._evaluation_dataloader()
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/d2go/runner/lightning_task.py", line 326, in _evaluation_dataloader
self._reset_dataset_evaluators()
File "/home/ubuntu/anaconda3/envs/det2/lib/python3.8/site-packages/d2go/runner/lightning_task.py", line 263, in _reset_dataset_evaluators
or self.trainer._accelerator_connector.use_ddp
AttributeError: 'AcceleratorConnector' object has no attribute 'use_ddp'

  1. please simplify the steps as much as possible so they do not require additional resources to run, such as a private dataset.

Used COCO dataset, obtained by running: https://gist.githubusercontent.com/jss367/a8eb11e5abd6e674f35ebfbb1f0d801c/raw/365fdae70531381d4eafbc588dfa1b7c85cd389c/coco.sh

Expected behavior:

Should train on 4 GPUS without error. (I'm using an AWS EC2 instance with 4 V100 GPUs). Note that single GPU training worked fine.

Update

Backing up to lightning 1.5.10 fixed this. It looks like 1.6.0 has an AcceleratorConnector refactor that caused this issue.

Sorry for the late reply, yes D2Go uses a older version of lightning (specified here:

d2go/setup.py

Line 30 in 68934eb

"pytorch-lightning @ git+https://github.com/PyTorchLightning/pytorch-lightning@9b011606f",
) to match internal lightning version.

@wat3rBro I don't think that version still exists on github

Ah okay thanks