gmberton/CosPlace

Error when trying to use own dataset

MysteryHS opened this issue · 4 comments

Hello,
We are trying to use your project with our own dataset, but we get an error when launching the training phase.
You can see a sample of our data below:

image

And the error message is the following:

2023-02-20 15:33:13   train.py --dataset_folder datasets/processed --backbone ResNet50 --fc_output_dim 128 --min_images_per_class 0 --groups_num 1
2023-02-20 15:33:13   Arguments: Namespace(L=2, M=10, N=5, alpha=30, augmentation_device='cuda', backbone='ResNet50', batch_size=32, brightness=0.7, classifiers_lr=0.01, contrast=0.7, dataset_folder='datasets/processed', device='cuda', epochs_num=50, fc_output_dim=128, groups_num=1, hue=0.5, infer_batch_size=16, iterations_per_epoch=10000, lr=1e-05, min_images_per_class=0, num_workers=8, positive_dist_threshold=25, random_resized_crop=0.5, resume_model=None, resume_train=None, saturation=0.7, save_dir='default', seed=0, test_set_folder='datasets/processed/test', train_set_folder='datasets/processed/train', use_amp16=False, val_set_folder='datasets/processed/val')
2023-02-20 15:33:13   The outputs are being saved in logs/default/2023-02-20_15-33-13
2023-02-20 15:33:13   Train only layer3 and layer4 of the ResNet50, freeze the previous ones
2023-02-20 15:33:14   There are 1 GPUs and 2 CPUs.
2023-02-20 15:33:15   Using cached dataset cache/processed_M10_N5_mipc0.torch
2023-02-20 15:33:15   Using 1 groups
2023-02-20 15:33:15   The 1 groups have respectively the following number of classes [28]
2023-02-20 15:33:15   The 1 groups have respectively the following number of images [179]
2023-02-20 15:33:17   Validation set: < val - #q: 466; #db: 1742 >
2023-02-20 15:33:17   Test set: < test - #q: 9990; #db: 2566 >
2023-02-20 15:33:17   Start training ...
2023-02-20 15:33:17   There are 28 classes for the first group, each epoch has 10000 iterations with batch_size 32, therefore the model sees each class (on average) 11428.6 times per epoch
/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py:554: UserWarning: This DataLoader will create 8 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  warnings.warn(_create_warning_msg(
  0%|                                                                     | 0/10000 [00:00<?, ?it/s]
2023-02-20 15:33:19   
Traceback (most recent call last):
  File "/content/gdrive/.shortcut-targets-by-id/1wvpt1FfBODh8ezpNJABgMlxsVSt4h7DU/processed/LIC/CosPlace/commons.py", line 21, in __next__
    batch = next(self.dataset_iterator)
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 628, in __next__
    data = self._next_data()
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1306, in _next_data
    raise StopIteration
StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train.py", line 115, in <module>
    images, targets, _ = next(dataloader_iterator)
  File "/content/gdrive/.shortcut-targets-by-id/1wvpt1FfBODh8ezpNJABgMlxsVSt4h7DU/processed/LIC/CosPlace/commons.py", line 24, in __next__
    batch = next(self.dataset_iterator)
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 628, in __next__
    data = self._next_data()
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1306, in _next_data
    raise StopIteration
StopIteration

2023-02-20 15:33:19   Experiment finished (with some errors)

We have tried to use the small dataset and the training is starting fine. Do you know where the error could be coming from?

Hello, your dataset is quite small (only 179 training images), so CosPlace is not the most suitable method for training on your dataset.
Anyway, if you really want to use CosPlace, I think your error happens because the number of classes is smaller than the batch size. Can you try running again the experiment with --batch_size=28 (as 28 is the number of classes within your dataset)?
Also, I would advise you to reduce the number of iterations per epoch, you can try some very low value like 10 --iterations_per_epoch=10 to see what happens

Why should the value of batch size be set to be the same as the number of classes?
Why need to reduce the number of iterations per epoch?

The dataset has been implemented to have length equal to the number of classes, although when using SF-XL you could just set the length to any large number and it wouldn't matter. However when the number of classes is small (smaller than the batch size), the dataloader is unable to prepare even one batch.
The iterations per epoch should be reduced because training for 10k iterations on 179 images would clearly lead to overfitting

Think you for your help