gmberton/CosPlace

Why is iter used here during training iterations instead of directly using the dataloader?

Closed this issue · 8 comments

for iteration in tqdm(range(args.iterations_per_epoch), ncols=100):


 dataloader = commons.InfiniteDataLoader(groups[current_group_num], num_workers=args.num_workers,
                                            batch_size=args.batch_size, shuffle=True,
                                            pin_memory=(args.device == "cuda"), drop_last=True)
    
    dataloader_iterator = iter(dataloader)
    model = model.train()
    
    epoch_losses = np.zeros((0, 1), dtype=np.float32)
    for iteration in tqdm(range(args.iterations_per_epoch), ncols=100):
           images, targets, _ = next(dataloader_iterator) 

Why not do it directly like this?


dataloader = commons.InfiniteDataLoader(groups[current_group_num], num_workers=args.num_workers,
                                            batch_size=args.batch_size, shuffle=True,
                                            pin_memory=(args.device == "cuda"), drop_last=True)
    
       model = model.train()
    
    epoch_losses = np.zeros((0, 1), dtype=np.float32)
    for images, targets, _ in tqdm(dataloader, ncols=100):

You can use the dataloader, but then it would continue until it runs out of data (or you would need to use a break after N iterations). Using an iterator is more dataset agnostic, and it is clearer how long the "epoch" is going to last (and using a full epoch over the whole dataset could take a day, so that is not an option).
Anyway, doing this

for iteration, (images, targets, _) in tqdm(enumerate(dataloader), ncols=100, total=args.iterations_per_epoch):
    if iteration >= args.iterations_per_epoch:
        break

is equivalent to this

dataloader_iterator = iter(dataloader)
for iteration in tqdm(range(args.iterations_per_epoch), ncols=100):
       images, targets, _ = next(dataloader_iterator) 

I just find the second example cleaner than the first

dataloader = commons.InfiniteDataLoader(groups[current_group_num], num_workers=args.num_workers,

Why not use torch.utils.data.dataloader directly?

It is also okay to use the standard DataLoader class, but using an InfiniteDataLoader ensures that if there are not enough samples the __iter__ will not raise a StopIteration. This could happen when changing the parameters within the code (e.g. using higher values for parameters --iterations_per_epoch, --batch_size, --L, --N might lead to this issue when using a standard DataLoader)

dataloader_iterator = iter(dataloader)
for iteration in tqdm(range(args.iterations_per_epoch), ncols=100):
       images, targets, _ = next(dataloader_iterator) 

What you mean is that with this approach, you can train on a portion of the data in each epoch without having to train on all the data, and only need to go through args.iterations_per_epoch iterations?

dataloader = commons.InfiniteDataLoader(groups[current_group_num], num_workers=args.num_workers,

dataloader = commons.InfiniteDataLoader(groups[current_group_num], num_workers=args.num_workers,
                                            batch_size=args.batch_size, shuffle=True,
                                            pin_memory=(args.device == "cuda"), drop_last=True)
    
    dataloader_iterator = iter(dataloader)

When I use the small dataset, it has only one group containing 5,965 classes. If args.batch_size is set to 8, there will be 745 BatchSamplers in the dataloader, meaning the dataloader can iterate 745 times to train on all the data. The statement "images, targets, _ = next(dataloader_iterator)" can only iterate 745 times as well. However, in the training loop, args.iterations_per_epoch is set to 10,000, indicating 10,000 iterations are expected, but in reality, only 745 iterations are possible because the statement "images, targets, _ = next(dataloader_iterator)" cannot iterate further after 745 iterations. If this is the case, your intention to train on only a portion of data in each epoch may not be achieved.

class InfiniteDataLoader(torch.utils.data.DataLoader):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.dataset_iterator = super().__iter__()
    
    def __iter__(self):
        return self
    
    def __next__(self):
        try:
            batch = next(self.dataset_iterator)
        except StopIteration:
            self.dataset_iterator = super().__iter__()
            batch = next(self.dataset_iterator)
        return batch

using an InfiniteDataLoader ensures that if there are not enough samples the iter will not raise a StopIteration.

When all samples are used up, the __iter__ will start iterating from the first batch again. Is that correct?

When args.iterations_per_epoch=10,000 and the DataLoader has 800 BatchSamplers, after completing the iteration of the 800th batch, the statement "images, targets, _ = next(dataloader_iterator)" in the for loop will start iterating from the first batch again. In this case, a single for training loop will train on the data multiple times, and eventually, one epoch will complete 10,000 iterations of training.

When args.iterations_per_epoch=10,000 and the DataLoader has 20,000 BatchSamplers, in this case, a single for training loop will only perform 10,000 iterations of training, training only a portion of the data.

This ensures that regardless of whether the number of BatchSamplers is greater than args.iterations_per_epoch, the for training loop in one epoch will always iterate args.iterations_per_epoch times.

Right?

Based on this setup, training on 8 groups of the processed dataset can lead to CosPlace achieving state-of-the-art performance?

All your assumptions are correct, and the answer to all your questions is yes.

Think you for your help