Error when training with TVQA dataset: AttributeError in DataLoader worker process
Closed this issue · 1 comments
Issue Description
I encountered an AttributeError when attempting to train a model using the TVQA dataset. All other datasets worked fine during the training setup.
Steps to Reproduce
Ran the following command:
torchrun --rdzv_endpoint 127.0.0.1:1234 --nproc_per_node 2 train.py --model 7B \
--max_seq_len 650 --batch_size 1 --epochs 5 --warmup_epochs 2 --bias 3 --tau 100. --max_feats 10 --dataset tvqa \
--blr 7e-2 --weight_decay 0.02 --output_dir ./checkpoint/tvqa --dataset tvqa --accum_iter 4 --sub --vaq --qav
Expected Behavior
The training process should have started without any issues.
Actual Behavior
The process failed with the following error message:
[19:42:30.947615] Start training for 5 epochs
Traceback (most recent call last):
File "train.py", line 153, in <module>
main(args)
File "train.py", line 129, in main
train_stats = train_one_epoch(model, data_loader_train, optimizer, epoch, loss_scaler, args=args)
File "/home/admin-guest/Documents/multimodal-ml/iqui/Flipped-VQA/engine.py", line 19, in train_one_epoch
for data_iter_step, data in enumerate(metric_logger.log_every(data_loader, print_freq, header)):
File "/home/admin-guest/Documents/multimodal-ml/iqui/Flipped-VQA/util/misc.py", line 129, in log_every
for obj in iterable:
File "/home/admin-guest/anaconda3/envs/flippedvqa_env/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
data = self._next_data()
File "/home/admin-guest/anaconda3/envs/flippedvqa_env/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data
return self._process_data(data)
File "/home/admin-guest/anaconda3/envs/flippedvqa_env/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
data.reraise()
File "/home/admin-guest/anaconda3/envs/flippedvqa_env/lib/python3.8/site-packages/torch/_utils.py", line 434, in reraise
raise exception
AttributeError: Caught AttributeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/admin-guest/anaconda3/envs/flippedvqa_env/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
data = fetcher.fetch(index)
File "/home/admin-guest/anaconda3/envs/flippedvqa_env/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/admin-guest/anaconda3/envs/flippedvqa_env/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/admin-guest/Documents/multimodal-ml/iqui/Flipped-VQA/dataloader/tvqa.py", line 173, in __getitem__
video, video_len = self._get_video(f'{vid}', start, end)
File "/home/admin-guest/Documents/multimodal-ml/iqui/Flipped-VQA/dataloader/tvqa.py", line 60, in _get_video
if len(video) > self.max_feats:
File "/home/admin-guest/anaconda3/envs/flippedvqa_env/lib/python3.8/site-packages/torch/utils/data/dataset.py", line 83, in __getattr__
raise AttributeError
AttributeError
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 14990) of binary: /home/admin-guest/anaconda3/envs/flippedvqa_env/bin/python
Traceback (most recent call last):
File "/home/admin-guest/anaconda3/envs/flippedvqa_env/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/home/admin-guest/anaconda3/envs/flippedvqa_env/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/admin-guest/anaconda3/envs/flippedvqa_env/lib/python3.8/site-packages/torch/distributed/run.py", line 719, in main
run(args)
File "/home/admin-guest/anaconda3/envs/flippedvqa_env/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/home/admin-guest/anaconda3/envs/flippedvqa_env/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/admin-guest/anaconda3/envs/flippedvqa_env/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-11-07_19:42:35
host : iquibalh-desktop
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 14990)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Environment
OS: [Ubuntu 22.04.3 LTS]
Python version: [3.8.18]
PyTorch version: [1.10.0+cu111]
GPU: 2x[NVIDIA Corporation [RTX A6000] 48GB]
Any other relevant environment details
Additional Context
This issue did not occur with other datasets, only with TVQA, all other datasets started training except for this one.
Thank you for your issue.
We have updated our code.
Please pull our repository again.