garvita-tiwari/PoseNDF

Your pretrained model is still wrong.

HospitableHost opened this issue · 25 comments

When I use your pretrained model to calculate the distance of the poses from AMASS, I get 'nan'.
For example, the pose (already in quaternion, and it is sampled from AMASS with your code):
[[ 0.9198, -0.3837, -0.0681, 0.0455],
[ 0.9138, -0.4025, 0.0352, -0.0405],
[ 0.9497, 0.3132, 0.0031, -0.0044],
[ 0.7995, 0.5880, 0.0242, -0.1203],
[ 0.7601, 0.6438, -0.0749, 0.0462],
[ 0.9952, -0.0918, 0.0040, 0.0338],
[ 0.9120, -0.3939, 0.0701, 0.0904],
[ 0.9162, -0.3983, -0.0442, -0.0026],
[ 1.0000, 0.0054, 0.0021, 0.0038],
[ 1.0000, 0.0000, 0.0000, 0.0000],
[ 1.0000, 0.0000, 0.0000, 0.0000],
[ 0.9611, -0.2573, 0.0985, 0.0199],
[ 0.9835, -0.0572, -0.0382, -0.1676],
[ 0.9785, -0.0461, 0.0593, 0.1920],
[ 0.9969, 0.0629, -0.0383, -0.0263],
[ 0.9057, 0.0084, -0.1829, -0.3823],
[ 0.9148, 0.0618, 0.1531, 0.3687],
[ 0.5447, -0.0337, -0.7993, 0.2514],
[ 0.5862, -0.0814, 0.7476, -0.3014],
[ 0.9316, -0.2648, -0.1939, 0.1562],
[ 0.9618, -0.2100, 0.1656, -0.0593]])
its distance is 'nan', when using your pretrained model.

I tried with this value of pose and I am getting 0.0000087.
Are you getting nans for other poses also ?

I tried with this value of pose and I am getting 0.0000087. Are you getting nans for other poses also ?

yes. Besides, I used the yaml file you updated yesterday, and the result is still nan for all poses.

I tried with this value of pose and I am getting 0.0000087. Are you getting nans for other poses also ?

my test code is below:
image
the result is below:
image

I tried with this value of pose and I am getting 0.0000087. Are you getting nans for other poses also ?

more intuitively:
image
image

I am also getting NANs with the pretrained model. They appear after the 4th layer of DFNet.
Would be great if you could fix that asap.

It can be observed that model.dfnet.lin2.weight_v[62:125] are all zeros, and this will cause weight normalization to produre nan gradually. Is this the reason ? @garvita-tiwari

I am also getting NANs with the pretrained model. They appear after the 4th layer of DFNet. Would be great if you could fix that asap.

I checked for my case, I don't have nans. Please share the pytorch3d version you are using

I am also getting NANs with the pretrained model. They appear after the 4th layer of DFNet. Would be great if you could fix that asap.

I checked for my case, I don't have nans. Please share the pytorch3d version you are using

0.7.2

I am also getting NANs with the pretrained model. They appear after the 4th layer of DFNet. Would be great if you could fix that asap.

I checked for my case, I don't have nans. Please share the pytorch3d version you are using

The code in PoseNDF class & DFNet class & BoneMLP class don't include pytorch3d, I think it has nothing to do with pytorch3d.
Besides, I input the tensor directly to the PoseNDF model.
image

I am also getting NANs with the pretrained model. They appear after the 4th layer of DFNet. Would be great if you could fix that asap.

I checked for my case, I don't have nans. Please share the pytorch3d version you are using

I am using pytorch3d 0.7.0, but I do not believe this to be the problem.
Like @raypine said, model.dfnet.lin2.weight_v[62:125] are all zeros which results in the corresponding vector norms to be 0 and thus producing nans at division. Could you check the values for model.dfnet.lin2.weight_v[62:125] at your end?

I wasn't able to figure out the reason, because I am not getting nans. But I have uploaded another trained model here.
model-test.

This is not the converged model, please try with this model

I wasn't able to figure out the reason, because I am not getting nans. But I have uploaded another trained model here. model-test.

This is not the converged model, please try with this model

The new one works, and the predicted distance is 0.
But it seems that you uploaded the older one than this new one before.
this new one's(37 minutes ago) epoch is 10990 epoch, and the old one's(19 days ago) epoch is 2058.

I wasn't able to figure out the reason, because I am not getting nans. But I have uploaded another trained model here. model-test.

This is not the converged model, please try with this model

I find that this new model output 0 for any pose, even for the noised pose.

I wasn't able to figure out the reason, because I am not getting nans. But I have uploaded another trained model here. model-test.
This is not the converged model, please try with this model

I find that this new model output 0 for any pose, even for the noised pose.

Can confirm...
@garvita-tiwari are you sure that the uploaded code is correct? E.g. in net_modules.py, ReLU is always used, no matter what argument is given for "activation".

@HospitableHost hi, have you figure out the problem of NaN?

@HospitableHost hi, have you figure out the problem of NaN?

No, the model is wrong, so nobody can fix it.

@HospitableHost hi, have you figure out the problem of NaN?

No, the model is wrong, so nobody can fix it.

@garvita-tiwari hi,i think posendf makes sense and can be used to constrain the reasonable space of the motion of animation, but there are some small error in current uploaded code or model, so can you upload correct code and model for testing? Let's improve the accuracy and generalization of the model, thanks a lot

Hi, I have figured out a bug in data and now training the model again. I will share the model asap(within 1-2 days) if there are no other bugs.

Hi, I have figured out a bug in data and now training the model again. I will share the model asap(within 1-2 days) if there are no other bugs.

@garvita-tiwari
Thank you!!!

Hi, I have figured out a bug in data and now training the model again. I will share the model asap(within 1-2 days) if there are no other bugs.

Hi, I find that you uploaded a new one yesterday, and then I tested it. And it seems still not good.
I have tested it using three activation functions(relu, lrelu and softplus), and it outputs the same order of magnitude for plausible poses and noisy poses.
(same for two kinds of noisy poses: sampled_pose = sampled_pose + sigma*np.random.rand(21,4)sampled_pose
and: sampled_pose = sampled_pose + sigma
np.random.rand(21,4))

Hi, I have figured out a bug in data and now training the model again. I will share the model asap(within 1-2 days) if there are no other bugs.

Hi, I find that you uploaded a new one yesterday, and then I tested it. And it seems still not good. I have tested it using three activation functions(relu, lrelu and softplus), and it outputs the same order of magnitude for plausible poses and noisy poses. (same for two kinds of noisy poses: sampled_pose = sampled_pose + sigma*np.random.rand(21,4)_sampled_pose and: sampled_pose = sampled_pose + sigma_np.random.rand(21,4))

Hi yes this is not the optimal mode, as it was trained on a subset of AMASS. I will upload the optimal model soon

Hi, I have figured out a bug in data and now training the model again. I will share the model asap(within 1-2 days) if there are no other bugs.

Hi, I find that you uploaded a new one yesterday, and then I tested it. And it seems still not good. I have tested it using three activation functions(relu, lrelu and softplus), and it outputs the same order of magnitude for plausible poses and noisy poses. (same for two kinds of noisy poses: sampled_pose = sampled_pose + sigma*np.random.rand(21,4)_sampled_pose and: sampled_pose = sampled_pose + sigma_np.random.rand(21,4))

Hi yes this is not the optimal mode, as it was trained on a subset of AMASS. I will upload the optimal model soon

@garvita-tiwari hi, whether the optimal model is ready?

Hi,

Please check version2 branch and model(and config) here:
https://nextcloud.mpi-klsb.mpg.de/index.php/s/dFbra3gfycYc5mz

In the original model, we applied normalization along the wrong axis:
pose = torch.nn.functional.normalize(pose.to(device=self.device),dim=1)

Although experimentally, it doesn't degrade the performance. We suspect it's because during training all the input quaternions are already unit quaternion and this particular layer act as a normalization layer.

This should be:
pose = torch.nn.functional.normalize(pose.to(device=self.device),dim=-1)
In version2, we have moved data normalization step and

Hi,

Please check version2 branch and model(and config) here: https://nextcloud.mpi-klsb.mpg.de/index.php/s/dFbra3gfycYc5mz

In the original model, we applied normalization along the wrong axis: pose = torch.nn.functional.normalize(pose.to(device=self.device),dim=1)

Although experimentally, it doesn't degrade the performance. We suspect it's because during training all the input quaternions are already unit quaternion and this particular layer act as a normalization layer.

This should be: pose = torch.nn.functional.normalize(pose.to(device=self.device),dim=-1) In version2, we have moved data normalization step and

the output poses from the function 'axis_angle_to_quaternion' is already normalized, I think. So, it seems that we don't need the 'pose = torch.nn.functional.normalize(pose.to(device=self.device),dim=-1)'.

Hi,
Please check version2 branch and model(and config) here: https://nextcloud.mpi-klsb.mpg.de/index.php/s/dFbra3gfycYc5mz
In the original model, we applied normalization along the wrong axis: pose = torch.nn.functional.normalize(pose.to(device=self.device),dim=1)
Although experimentally, it doesn't degrade the performance. We suspect it's because during training all the input quaternions are already unit quaternion and this particular layer act as a normalization layer.
This should be: pose = torch.nn.functional.normalize(pose.to(device=self.device),dim=-1) In version2, we have moved data normalization step and

the output poses from the function 'axis_angle_to_quaternion' is already normalized, I think. So, it seems that we don't need the 'pose = torch.nn.functional.normalize(pose.to(device=self.device),dim=-1)'.

Yes the output is already normalized. You don't need this step. I am keeping this, in case I don't have axis_angle_to_quaternion performed in previous step. e.g. in experiments/sample_poses.py