una-dinosauria/3d-pose-baseline

Input Data Normalization

Closed this issue · 6 comments

Hi,

First of all, thanks for open sourcing such a great piece of work. I have a question about the normalization of 2D inputs (both GT and SH):

  • Both 2D GT and SH detections are in image coordinates (1000x1000) and you perform normalization directly on these values. IMO, this means that the model learns to take location of the subject into account during training. I wonder if you ever tried to use relative joint locations around a root joint as input. (A normalization similar to centering 3D points around the root) Could it be useful to do such normalization?

Hi @mkocabas!

IMO, this means that the model learns to take location of the subject into account during training.

I agree.

I wonder if you ever tried to use relative joint locations around a root joint as input. (A normalization similar to centering 3D points around the root)

I did try and I remember the network did worse in the experiments without Procrustes alignment in post processing; it was about the same with Procrustes alignment. It should be easy to try it with our code.

Could it be useful to do such normalization?

It probably helps to estimate the person's scale, since there is a perspective projection happening and the network has not other visual information. Estimating 3d models up to scale is common in SfM pipelines and other 3d reconstruction tasks. I don't think it will make 3d pose estimation better per se.

Thanks for the valuable answer @una-dinosauria!

I did try and I remember the network did worse in the experiments without Procrustes alignment in post processing; it was about the same with Procrustes alignment. It should be easy to try it with our code.

I got it, but I feel that centering may help to produce plausible predictions for in the wild images with person in the center. So, I'll give it a try!

It probably helps to estimate the person's scale, since there is a perspective projection happening and the network has not other visual information.

I couldn't understand here. I expect centering to make worse scale predictions, since the input is in relative coordinates. But, do we need to estimate person's scale at all? How can predicting relative joint locations in 3D make use of scale information?

But, do we need to estimate person's scale at all?

In Protocol #1 of H3.6M the error is computed in such a way that we do. In other applications maybe you do not.

How can predicting relative joint locations in 3D make use of scale information?

Perspective. There is a ground plane and we are looking down to it. High ==> further in the back. Low ==> closer to the camera.

I understand it, thank you so much!

Closing for lack of activity. Please reopen if the issue is still ongoing.

Hi @mkocabas, great to see you here. I was working on VIBE, and I wanted to understand if the Humans3.6m 3D poses in this repository and the output of the ones from VIBE are in the same coordinate system. As in if the X, Y, and Z axes are the same or interchanged in some other order?