tonyzhaozh/act

Suggestion: Increase sample density per epoch to lower validation loss variance

Opened this issue · 0 comments

After evaluating a few different checkpoints during training, I found that the checkpoint with the best validation loss, didn't necessarily produce the best result when rolled out on the robot. Sometimes the last_checkpoint is better, and sometimes a previously checkpoint with a higher validation loss is better.

I highly suspect this is due to the sampling strategy used.

https://github.com/tonyzhaozh/act/blob/dfe6c7f5ff13ecb4a9dec887f000c0e5d8afba72/utils.py#L35C4-L35C4

https://github.com/tonyzhaozh/act/blob/dfe6c7f5ff13ecb4a9dec887f000c0e5d8afba72/utils.py#L20C1-L21C37

From these two lines.. it looks like an "item" in the dataset is a single random uniform sample from a trajectory.

I think this means that the validation set on each epoch, consists of say.. 10 randomly sampled timesteps, 1 from each episode in the validation set.

This I think is why the validation loss variance is kinda big during training.

When running 5000 epochs, I think it's quite possible to draw a "validation set" that achieves a low loss by "luck of the draw" (literally).

To deal with this, in my clone I made a simple change... adding a samples_per_epoch parameter... that draws n samples per epoch (with replacement).

class EpisodicDataset(torch.utils.data.Dataset):
    def __init__(self, episode_ids, dataset_dir, camera_names, norm_stats, samples_per_epoch=1):
        super(EpisodicDataset).__init__()
        self.episode_ids = episode_ids
        self.dataset_dir = dataset_dir
        self.camera_names = camera_names
        self.norm_stats = norm_stats
        self.is_sim = None
        self.samples_per_epoch = samples_per_epoch
        self.__getitem__(0) # initialize self.is_sim

    def __len__(self):
        return len(self.episode_ids) * self.samples_per_epoch

    def __getitem__(self, index):
        index = index % len(self.episode_ids)

Then divide the number of epochs to train by the samples_per_epoch.

in my most recent run I used 8 samples per epoch for 625 epochs of training (100 episodes in that dataset).

I saw a big drop in the validation variance.. also, if I train longer, I do appear to get better quality rollouts for lower validation losses.

If training on 50 episodes, I might try 10/500 or 16/300

The best part, is you can restore the original behavior, just by setting samples_per_epoch = 1.