yasaminjafarian/HDNet_TikTok

Runtime and Dense Pose

pengcanon opened this issue · 3 comments

Thank you for sharing the project---very impressive work.

I have two quick questions. First, in the paper it says that the network takes two inputs; the original image and the human figure mask. But on Github, a third input of the denspose estimate is suggested. Can you please clarify? Second, what is the runtime like? how long does it take for the network to process one image?

Thank you

Thanks a lot for your interest in the paper.
The input to the network is image, mask, and densepose IUV map as shown in figure 4 of the paper.
The runtime is very fast for the inference as the network is a light hourglass. I have not computed that though. You can check that with the provided inference code in google colab.

Thank you so much for your speedy response. The reason I'm asking about the runtime is because I wonder if this could be used for webcam live stream at a reasonable framerate, say >10frs/s. I guess I'll make a webcam interface and try myself. I'll let you know the performance. But feel free to warn me if you foresee any problems

I don't see a problem. I think it can be feasible but let me know how it turns out. Thanks a lot. I will close this issue but you can update me through my email: yasamin@umn.edu
Note that generating the mesh from depth can be time-consuming. This line.
But if you comment this line and set the visualization False, the inference by itself is very fast.