una-dinosauria/3d-pose-baseline

Issues with output from maskRCNN key points

Closed this issue · 4 comments

Operating System: Ubuntu 18.04
Tensorflow: Docker file FROM tensorflow/tensorflow:1.13.1-gpu-py3

Hello,

I apologize in advance for the long post.

Thanks for making the code for this research open and easy to read and use.

I'm performing research for the University of New Brunswick where we're evaluating view invariance with 3d pose estimation deep learning algorithms for clinical mobility tests (determine how long it takes a patient to stand up from a chair and walk a certain distance can be used as a mobility metric). We have tested DMHS on videos we have recorded in our lab and it works great to extract movement timings from some perspectives but completely falls apart for others. DMHS goes directly from raw image to 3d skeleton and we are hoping that a 2d skeleton to 3d skeleton will work better as the 2d pose estimations seems to much more robust to different perspectives (we are currently using maskRCNN with facebooks detectron to get the 2d skeletons). In your paper you mentioned transforming skeletons to a consistent view and removing the view variance during training. This video seemed to work quite well for changing views so we decided to give your model a try. I am running your model on 2d key points we calculated from maskRCNN and the output is weird

130

2

On the right you can see that 2d skeleton from mask after it has been normalized, put into the model, and then unormalized. On the right you can see the pose as predicted by your model.

I think there are two reasons why this is happening

  1. The order that the 2d skeleton points are being put into the model is not correct. I need to convert from maskRCNN keypoints to something your model can use. I've followed mappings for stacked hour glass it shares the same key points (for the most part) with maskRCNN

image

so my array that I'm inputting the transformations, normalization, and model looks like

array[0:1] = [right_foot.x, right_foot.y]
....
array[30:31] = [right_wrist.x, right_wrist.y]

Once i create that array it goes through

    poses = np.reshape(poses,[poses.shape[0], -1])
    poses_final = np.zeros([poses.shape[0], len(H36M_NAMES)*2])
    dim_to_use_x = np.where(np.array([x != '' and x != 'Neck/Nose' for x in H36M_NAMES]))[0] * 2
    dim_to_use_y = dim_to_use_x+1
    dim_to_use = np.zeros(len(SH_NAMES)*2,dtype=np.int32)
    dim_to_use[0::2] = dim_to_use_x
    dim_to_use[1::2] = dim_to_use_y

    poses_final[:,dim_to_use] = poses
    data["camera" + str(camera_num)] = poses_final

    data_mean_2d, data_std_2d,  dim_to_ignore_2d, dim_to_use_2d = normalization_stats(poses_final, dim=2)
    data_set = normalize_data(data, data_mean_2d, data_std_2d, dim_to_use_2d )

and then later into the model.

  1. The second thing that I feel may be wrong is that since I do not have 3d skeletons for this model that I can use, I do not have the variables

data_std_3d, data_mean_3d, dim_to_ignore_3d
to unNormalize the output from the model so I can work with the skeleton. I assumed that the de-normalization would work ok with just using the 3d keypoint statistics from the h36m .h5 files that I downloaded from wget https://www.dropbox.com/s/e35qv3n6zlkouki/h36m.zip from the README so I just ran

train_set_3d, test_set_3d, data_mean_3d, data_std_3d, dim_to_ignore_3d, dim_to_use_3d, train_root_positions, test_root_positions = read_3d_data(
    actions, FLAGS.data_dir, FLAGS.predict_14)

I changed the function to ignore the rcams.

I feel the issues with the output are due to my misuse of the model and not due to the model itself.

I can provide the code I'm using to run everything as well.

Thank you

Hi!

Re: 1: I'm pretty busy with a deadline right now, but there is followup work from Facebook that has used detectron and our 2d-3d system before. They also have super good documentation. You might want to check it out: https://github.com/facebookresearch/VideoPose3D/blob/master/DATASETS.md

Re: 2: IMO that sounds good -- using the 3d stats from H3.6M should work fine.

Hope that helps!

Thanks!

No rush on this response I'm working on the VideoPose3D model you recommended because it uses detectron so I don't have to worry about input data errors.

I found an implementation of the VideoPose3D that uses detectron and images from the wild . I'll implement this and let you know how it goes. I do want to figure out what's going on with my implementation of your model. Was there anything special you did when running your in the wild video? Watching that video was what gave me confidence that you're model may work with our dataset.

We also have videos of people moving towards the camera, changing their size in terms of camera pixels. You can see an image of the person starting sitting far away from the camera and then standing close to the camera below.
0

136

The error in the first photo's 3D predicted pose makes me think I implemented your model wrong. The error in the second photo may be caused by the same thing, but do you think with the changing perspective and hence pixel size of the 2d skeleton has impacted the final 3d pose created and if so could this be because of the normalization pre-processing step? I apologize if these questions are trivial. I'm in my undergrad and this is my first time working with computer vision and deep learning models.

Was there anything special you did when running your in the wild video?

Nope, just ran it on the outout of stacked hourglass as we do in our code.

do you think with the changing perspective and hence pixel size of the 2d skeleton has impacted the final 3d pose created and if so could this be because of the normalization pre-processing step?

It may have a small effect, but it's really hard to tell with the 3d poses being all mangled up. I wouldn't think much about it until the 3d poses make sense.

Closing for lack of activity. Please reopen if the issue is still ongoing.