Suggestion: Increase sample density per epoch to lower validation loss variance
Opened this issue · 0 comments
After evaluating a few different checkpoints during training, I found that the checkpoint with the best validation loss, didn't necessarily produce the best result when rolled out on the robot. Sometimes the last_checkpoint is better, and sometimes a previously checkpoint with a higher validation loss is better.
I highly suspect this is due to the sampling strategy used.
https://github.com/tonyzhaozh/act/blob/dfe6c7f5ff13ecb4a9dec887f000c0e5d8afba72/utils.py#L35C4-L35C4
From these two lines.. it looks like an "item" in the dataset is a single random uniform sample from a trajectory.
I think this means that the validation set on each epoch, consists of say.. 10 randomly sampled timesteps, 1 from each episode in the validation set.
This I think is why the validation loss variance is kinda big during training.
When running 5000 epochs, I think it's quite possible to draw a "validation set" that achieves a low loss by "luck of the draw" (literally).
To deal with this, in my clone I made a simple change... adding a samples_per_epoch parameter... that draws n samples per epoch (with replacement).
class EpisodicDataset(torch.utils.data.Dataset):
def __init__(self, episode_ids, dataset_dir, camera_names, norm_stats, samples_per_epoch=1):
super(EpisodicDataset).__init__()
self.episode_ids = episode_ids
self.dataset_dir = dataset_dir
self.camera_names = camera_names
self.norm_stats = norm_stats
self.is_sim = None
self.samples_per_epoch = samples_per_epoch
self.__getitem__(0) # initialize self.is_sim
def __len__(self):
return len(self.episode_ids) * self.samples_per_epoch
def __getitem__(self, index):
index = index % len(self.episode_ids)
Then divide the number of epochs to train by the samples_per_epoch.
in my most recent run I used 8 samples per epoch for 625 epochs of training (100 episodes in that dataset).
I saw a big drop in the validation variance.. also, if I train longer, I do appear to get better quality rollouts for lower validation losses.
If training on 50 episodes, I might try 10/500 or 16/300
The best part, is you can restore the original behavior, just by setting samples_per_epoch = 1.