Upgrade pose detection from PoseNet MobileNetV1 to MoveNet, PoseNet 2.0 ResNet50, or BlazePose

Question

Upgrade pose detection from PoseNet MobileNetV1 to MoveNet, PoseNet 2.0 ResNet50, or BlazePose

ivelin opened this issue 4 years ago · 10 comments

UPDATE: June 11, 2021

Is your feature request related to a problem? Please describe.
Currently we use mobilnetv2 with a 300x300 input tensor (image) by default for object detection.

It struggles with poses of people laying down on the floor. We experimented with rotating images +/-90', which improves overall detection rates, but it still misses poses of fallen people, even when the full body is clearly visible by a human eye.

Clearly the model has not been trained on fallen people poses.

Describe the solution you'd like

Google AI introduced MoveNet on May 17, 2021:
30fps on a mobile phone. Initially for TensorflowJS with a follow up model release coming to TFLite.
- TFLite version of the model
- Interactive colab.
Google AI released PoseNet 2.0 with a Resnet50 in 2020 base model which has 5-6fps performance on desktop CPU and noticeably better detection rates. Interactive web demo here. However testing shows that even with these improvements, it still misses some poses of people laying down (fallen poses) that are otherwise easy for a human eye to recognize. See an example recorded video below that provides a reference for situations when resnet misses poses.
Google AI MediaPipe released a new iteration of BlazePose, which detects 33 (vs 15) keypoints at 25-55fps on desktop CPU (5-10 times faster than PoseNet 2 ResNet50). Testing shows that blazepose does a better job with horizontal people poses, although it still misses some laying positions. See attached video for reference. BlazePose interactive web demo here. Pose detection TFLite model here.

Additional context

Other 2D pose detection models

See TensorFlow 2 Detection Model Zoo
https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/tf2_detection_zoo.md

Notice the high performance and dual purpose (object + keypoints) for CenterNet Resnet50 V1 FPN Keypoints 512x512 and CenterNet Resnet50 V2 Keypoints 512x512.

More on CenterNet and its various applications for object detection, posedetection and object motion tracking.
https://github.com/xingyizhou/CenterNet

3D pose detection
There are new models being developed for 3D pose estimation, which could further increase fall detection performance in the future.

Answer 1 · 2021-01-08T18:35:24.000Z

The various JS models are easy to see in the demo console log:

https://storage.googleapis.com/tfjs-models/demos/posenet/camera.html

For example a good resnet model for single-pose with balanced parameters for CPU (see screenshot for param details):
https://storage.googleapis.com/tfjs-models/savedmodel/posenet/resnet50/quant2/group1-shard11of12.bin

Answer 2 · 2021-01-08T18:36:08.000Z

TF saved model checkpoints for PoseNet 2 are also listed here: https://github.com/tensorflow/tfjs-models/tree/master/posenet/src

Answer 3 · 2021-01-08T18:46:58.000Z

My testing shows that the resnet50 model is noticeably more accurate than the mobilnet model, although its 30% slower (6fps vs 10fps). Surprisingly the multi-person performs about as fast as the single-person. Also single person can be confused if there are multiple people in the image.

With these findings, I think it's more important to upgrade to a resnet model (e.g. insput 250, stride 32, quantized int or 2 byte float) and less important whether its multi-pose or single-pose.

Answer 4 · 2021-01-08T18:50:03.000Z

@bhavikapanara ^^^

Answer 5 · 2021-01-12T01:01:13.000Z

@bhavikapanara thoughts on this one?

I've done more testing between the current mobilnetv1 model and single-person resnet50 with the parameters in the previous comment (250x250, stride 32, 2 byte float quantization).

I find the resnet50 model to be with slightly slower inference time but with a lot better performance in several important areas:

Detects correctly pose key points in more practical situations, such as:
- When the person is facing away from the camera (and face keypoints are not there to be detected)
- When there are obstacles in the way (chairs, tables) which prevent the model from seeing parts of the body.
- Low ambient lighting conditions. Resnet50 seems to be able to detect correctly key points in a dark photo image even when it's very hard to see the person with a naked human eye.
Has fewer false positives. Restnet50 doesn't get tricked as easily as Mobilnetv1 by paintings, pets and other objects that have some semblance of a human body.

I would like to know what your own experiments show.

If you are able to verify my findings on your data sets, I think upgrading to a resnet model should be the next priority on the roadmap for improving fall detection.

Answer 6 · 2021-02-11T00:46:23.000Z

PoseNet 2.0 ResNet 50 testing video

https://youtu.be/6Dz12WtpWuM

PoseNet.Resnet50.Screen.Recording.2021-02-10.at.5.53.02.PM.mov

Answer 7 · 2021-02-11T00:56:13.000Z

BlazePose testing video

https://youtu.be/mpqsm1aXUVc

Answer 8 · 2021-02-11T05:22:32.000Z

BlazePose model card: https://drive.google.com/file/d/1zhYyUXhQrb_Gp0lKUFv1ADT3OCxGEQHS/view?usp=drivesdk

TFlite models for pose detection (phase 1) and key point estimation (phase 2).

https://google.github.io/mediapipe/solutions/models.html#pose

An interesting detail worth investigating deeper is the fact the BalzePose estimates body vector as part of the first phase - Pose Detection. That’s before it runs the second phase for key point estimation.

Since for fall detection we are mainly interested in the spinal vector, this could mean an even faster performing inference.

See this text from the blog:

http://ai.googleblog.com/2020/08/on-device-real-time-body-pose-tracking.html

“for the human pose tracking we explicitly predict two additional virtual keypoints that firmly describe the human body center, rotation and scale as a circle. Inspired by Leonardo’s Vitruvian man, we predict the midpoint of a person's hips, the radius of a circle circumscribing the whole person, and the incline angle of the line connecting the shoulder and hip midpoints. This results in consistent tracking even for very complicated cases, like specific yoga asanas. “

Answer 9 · 2021-02-11T20:12:36.000Z

More test data with the mobilnetv1 model that shows situations where it is not able to detect a human pose on the ground even though it's easy for a human eye to see it.

Answer 10 · 2021-02-27T19:34:17.000Z

@bhavikapanara Google AI MediaPipe just released a [3D update to BlazePose](One step closer to 3D pose detection: https://google.github.io/mediapipe/solutions/pose) with a Z axis value for depth. This can be helpful for cases when a person falls along the Z axis and the X,Y change vector angle remains small , but it is not telling us the whole story.