mlvlab/Flipped-VQA

Error when training with TVQA dataset: AttributeError in DataLoader worker process

Closed this issue · 1 comments

Issue Description
I encountered an AttributeError when attempting to train a model using the TVQA dataset. All other datasets worked fine during the training setup.

Steps to Reproduce
Ran the following command:

torchrun --rdzv_endpoint 127.0.0.1:1234 --nproc_per_node 2 train.py --model 7B \
--max_seq_len 650 --batch_size 1 --epochs 5 --warmup_epochs 2 --bias 3 --tau 100. --max_feats 10 --dataset tvqa \
--blr 7e-2 --weight_decay 0.02 --output_dir ./checkpoint/tvqa --dataset tvqa --accum_iter 4 --sub --vaq --qav

Expected Behavior
The training process should have started without any issues.

Actual Behavior
The process failed with the following error message:

[19:42:30.947615] Start training for 5 epochs
Traceback (most recent call last):
  File "train.py", line 153, in <module>
    main(args)
  File "train.py", line 129, in main
    train_stats = train_one_epoch(model, data_loader_train, optimizer, epoch, loss_scaler, args=args)
  File "/home/admin-guest/Documents/multimodal-ml/iqui/Flipped-VQA/engine.py", line 19, in train_one_epoch
    for data_iter_step, data in enumerate(metric_logger.log_every(data_loader, print_freq, header)):
  File "/home/admin-guest/Documents/multimodal-ml/iqui/Flipped-VQA/util/misc.py", line 129, in log_every
    for obj in iterable:
  File "/home/admin-guest/anaconda3/envs/flippedvqa_env/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
    data = self._next_data()
  File "/home/admin-guest/anaconda3/envs/flippedvqa_env/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data
    return self._process_data(data)
  File "/home/admin-guest/anaconda3/envs/flippedvqa_env/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
    data.reraise()
  File "/home/admin-guest/anaconda3/envs/flippedvqa_env/lib/python3.8/site-packages/torch/_utils.py", line 434, in reraise
    raise exception
AttributeError: Caught AttributeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/admin-guest/anaconda3/envs/flippedvqa_env/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/admin-guest/anaconda3/envs/flippedvqa_env/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/admin-guest/anaconda3/envs/flippedvqa_env/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/admin-guest/Documents/multimodal-ml/iqui/Flipped-VQA/dataloader/tvqa.py", line 173, in __getitem__
    video, video_len = self._get_video(f'{vid}', start, end)
  File "/home/admin-guest/Documents/multimodal-ml/iqui/Flipped-VQA/dataloader/tvqa.py", line 60, in _get_video
    if len(video) > self.max_feats:
  File "/home/admin-guest/anaconda3/envs/flippedvqa_env/lib/python3.8/site-packages/torch/utils/data/dataset.py", line 83, in __getattr__
    raise AttributeError
AttributeError

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 14990) of binary: /home/admin-guest/anaconda3/envs/flippedvqa_env/bin/python
Traceback (most recent call last):
  File "/home/admin-guest/anaconda3/envs/flippedvqa_env/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/admin-guest/anaconda3/envs/flippedvqa_env/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/admin-guest/anaconda3/envs/flippedvqa_env/lib/python3.8/site-packages/torch/distributed/run.py", line 719, in main
    run(args)
  File "/home/admin-guest/anaconda3/envs/flippedvqa_env/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run
    elastic_launch(
  File "/home/admin-guest/anaconda3/envs/flippedvqa_env/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/admin-guest/anaconda3/envs/flippedvqa_env/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-11-07_19:42:35
  host      : iquibalh-desktop
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 14990)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Environment
OS: [Ubuntu 22.04.3 LTS]
Python version: [3.8.18]
PyTorch version: [1.10.0+cu111]
GPU: 2x[NVIDIA Corporation [RTX A6000] 48GB]
Any other relevant environment details
Additional Context
This issue did not occur with other datasets, only with TVQA, all other datasets started training except for this one.

ikodoh commented

Thank you for your issue.
We have updated our code.
Please pull our repository again.