ucla-mobility/V2V4Real

IndexError: list index out of range, during training

Closed this issue · 8 comments

Hi,

So far I am able to run the test command and get the same object detection AP numbers using the pre-trained models.
But I got IndexError related to the data loader when training the model from scratch.

When running the distributed training command, I got got IndexError immediately at the first step of training:
| distributed init (rank 1): env://
| distributed init (rank 2): env://
| distributed init (rank 0): env://
| distributed init (rank 3): env://
-----------------Dataset Building------------------
Traceback (most recent call last):
File "opencood/tools/train.py", line 207, in
main()
File "opencood/tools/train.py", line 49, in main
shuffle=False)
File "/opt/conda/envs/v2v4real/lib/python3.7/site-packages/torch/utils/data/distributed.py", line 91, in init
self.num_samples = math.ceil(len(self.dataset) / self.num_replicas) # type: ignore[arg-type]
File "/home/eddy/V2V4Real/opencood/data_utils/datasets/basedataset.py", line 104, in len
return self.len_record[-1]
IndexError: list index out of range

When running the single GPU training command, the training is able to complete one epoch. After that I got a similar IndexError when the validation data loader is used right after one epoch of training:
Training start
learning rate 0.0002000
[epoch 0][1776/1776], || Loss: 2.3861 || Conf Loss: 0.4103 || Loc Loss: 1.9757: 100%|█| 1776/1776 [3
Traceback (most recent call last):
File "opencood/tools/train.py", line 207, in
main()
File "opencood/tools/train.py", line 189, in main
for i, batch_data in enumerate(val_loader):
File "/opt/conda/envs/v2v4real/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 438, in iter
return self._get_iterator()
File "/opt/conda/envs/v2v4real/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 384, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File "/opt/conda/envs/v2v4real/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1086, in init
self._reset(loader, first_iter=True)
File "/opt/conda/envs/v2v4real/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1119, in _reset
self._try_put_index()
File "/opt/conda/envs/v2v4real/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1353, in _try_put_index
index = self._next_index()
File "/opt/conda/envs/v2v4real/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 642, in _next_index
return next(self._sampler_iter) # may raise StopIteration
File "/opt/conda/envs/v2v4real/lib/python3.7/site-packages/torch/utils/data/sampler.py", line 237, in iter
sampler_iter = iter(self.sampler)
File "/opt/conda/envs/v2v4real/lib/python3.7/site-packages/torch/utils/data/sampler.py", line 76, in iter
return iter(range(len(self.data_source)))
File "/home/eddy/V2V4Real/opencood/data_utils/datasets/basedataset.py", line 104, in len
return self.len_record[-1]
IndexError: list index out of range
[epoch 0][1776/1776], || Loss: 2.3861 || Conf Loss: 0.4103 || Loc Loss: 1.9757: 100%|█| 1776/1776 [3

I wonder whether there are some problems related to the data loader or the reinitialize() function?
Thanks!

Hi, I test the code on several of my workstations, it just works fine. Are you able to run the vis_data_sequence.py on both train and val dataset?

Hi,
Yes, I am able to run the vis_data_sequence.py on both train and val dataset.
I can see the Open3D's point clouds and bounding boxes.
The train dataset has 7105 samples and the val set has 1994 samples.
The code vis_data_sequence.py does not terminate by itself due to the while True infinite loop at https://github.com/ucla-mobility/V2V4Real/blob/main/opencood/visualization/vis_utils.py#L695.
So I think my datasets and the data paths in the related config yaml files are correct.
Thanks!

The value of "validate_dir" should be "test" instead of "validate".

Hi, Yes, I am able to run the vis_data_sequence.py on both train and val dataset. I can see the Open3D's point clouds and bounding boxes. The train dataset has 7105 samples and the val set has 1994 samples. The code vis_data_sequence.py does not terminate by itself due to the while True infinite loop at https://github.com/ucla-mobility/V2V4Real/blob/main/opencood/visualization/vis_utils.py#L695. So I think my datasets and the data paths in the related config yaml files are correct. Thanks!

The value of "validate_dir" in config yaml should be "test" instead of "validate".

Hi onepeachbiubiubiu ,
Thank you very much! This solves my problem.
My training config had the incorrect validate_dir setting as you pointed out.

Hi, eddyhkchiu
I hope I didn't disturb you. How did you solve the visualization problem of Open3D mentioned in question 10? Looking forward to your reply.
Thank you~

Hi YuJiXYZ,

Originally I was using my mac to ssh to Google cloud compute engine instance with Xquartz for visualization. This approach does not work for Open3D, which requires OpenGL 4.1. But mac only supports OpenGL 2.1 in general.

My solution is to setup remote desktop in Google cloud compute engine instance by following https://cloud.google.com/architecture/chrome-desktop-remote-on-compute-engine .

Hope this helps.

Ok, thank you. (●'◡'●)

HI @eddyhkchiu
when I run the visualization program, I encounter the "list index out of range" issue. I am using the OPV2V format V2V4Real dataset, and I have also changed the path in the YAML file to the absolute path of the test dataset. The error occurs at this line of code: https://github.com/ucla-mobility/V2V4Real/blob/main/opencood/visualization/vis_utils.py#L695. During debugging, I found that the variable aabbs is an empty list below this line of code. I am not sure why this happens. I hope to get your help!