FGiuliari/Trajectory-Transformer

Data preparation

alexmonti19 opened this issue · 2 comments

Hi,
first of all congratulations for the interesting paper and thanks for sharing the code with us! :)

Mine is more a doubt than a real issue, but I thought it could be useful opening a thread as a reference for people that could come across the same question in the future.
As I read in the paper, I see you're normalizing the input data by subtracting the mean and dividing by the standard deviation.

means=[]
stds=[]
for i in np.unique(train_dataset[:]['dataset']):
ind=train_dataset[:]['dataset']==i
means.append(torch.cat((train_dataset[:]['src'][ind, 1:, 2:4], train_dataset[:]['trg'][ind, :, 2:4]), 1).mean((0, 1)))
stds.append(
torch.cat((train_dataset[:]['src'][ind, 1:, 2:4], train_dataset[:]['trg'][ind, :, 2:4]), 1).std((0, 1)))
mean=torch.stack(means).mean(0)
std=torch.stack(stds).mean(0)

I've noticed that you collect xy means and stds from the different scenes in two vectors, average the vectors to obtain single values and then apply these average values to the entire dataset (that contains stacked trajectories from different input files/scenes).

Have you tried instead to standardize the trajectories from a specific file with its own "scene-specific" mean and std, and only then stack the trajectories from different scenes?

Cheers
Alex

Hi Alex,
the normalization was one of the few problems that we encountered and still are not sure about.
The main problem is that you know the mean and std of all the training data, but not on the testing.

This leads to problems when the datasets have very large differences in mean and std for example in the eth case where the speeds are increase wrt to the other 4 scenes.
In the sr-lstm paper they say that the video is accelerated so they sample from the video with a higher frequency, but that leads to different results that are not comparable to other methods that did not take this approach. (for example they have eth down to 0.60s or something for the mad; using their resampled data we can go also go down to something like 0.58; but we wanted to compare it with other methods so we used the original Social-GAN data).

Along the line, we tried different methods. so I think it is something that we tried but I don't have the number so you are welcome to try and if you want post here the results.

One problem that I see with that approach though is that while you can do a standardization on a scene-by-scene basis you cannot do so for the testing dataset. So you would have to use either a weighted mean or something along those lines for the testing mean and std.

By collecting xy means and std and using those I can still have some variance in the normalized inputs even during training since not all datasets have the same mean and std.

br,
Francesco

This leads to problems when the datasets have very large differences in mean and std for example in the eth case where the speeds are increase wrt to the other 4 scenes.
In the sr-lstm paper they say that the video is accelerated so they sample from the video with a higher frequency, but that leads to different results that are not comparable to other methods that did not take this approach. (for example they have eth down to 0.60s or something for the mad; using their resampled data we can go also go down to something like 0.58; but we wanted to compare it with other methods so we used the original Social-GAN data).

I see; I'm unfortunately quite familiar too with the several flaws this dataset carries around, and despite everything it oddly became a sort of de facto standard for this kind of task ¯\_(ツ) _/¯
Anyway, I'll try normalizing on a scene-by-scene basis and let you know if I find something interesting.

Thanks for your response and your time :)
Alex