Use a Neural Network for hand recognition & tracking
snavas opened this issue · 11 comments
Maybe to try a lightweight neural network (like YOLO?) with the colorimage, depthimage & fingertip features. Use current approach to generate training data.
One easy out-of-the-box solution for feature detection of hands is this.
However it doesnt detect hands in gloves.
An implementation of this can be found in the branch issue/neuralnet
Important Sidenote:
Mediapipe can detect hands with a wide variety of skin tones. (problematic are tattooes though)
I managed to train a the deeplabv3 model from pytorch to do semantic hand segmentation using a tutorial . But the result is much slower than hoped for :/
Here is one gif of the detection at full image resolution:
Here the resolution was reduced by 30%:
Edit:
This was done on the GPU not CPU. So thats not the cause of the performance issue.
Edit:
I also tried cropping the image to the size of one hand, which increases performance a little bit (0.8 sec per frame instead of 1.2) but it does not scale well if there are several hands detected (n*0.8 sec per frame)
I am currently trying this tutorial wich has a well documented repository on github. It provides several different models as backbones, including very lightweight models like MobileNet. So I am hoping for a good inference time.
Here is the result using FCN32 and MobilNet from the previously mentioned repository.
It is much faster but also very inacurate. The inacuracy might be due to the fact that I only used one third of the training data this time. I will try to train with more data and maybe switch out the models.
Update:
Here is FCN8 and Mobilenet with more training data (25 epochs):
And here Segnet and Mobilenet (5 epochs):
Update2:
I think no major improvements in accuracy while maintaining the speed can be expected now. Atleast not with the available time and knowledge I have about the topic. So in conclusion I think semantic segmentation is not a viable option for this project.
Currently depth and optical flow are not used for the segmentation. However the depth values from the camera are very inacurate, so there is not much that can be done with them.
Usage of optical flow to segment hands and maybe even recognise gestures should be investigated.
Imo hand feature detection (eg with mediapipe) is much more feasible in this project. It has the downside of having animate the detection.
Opencv provides a builtin class for background removal:
link
We should try it out.
Edit:
Here is a quick implementation:
So this obviously doesnt work on its own. But I still think background removal should be investigated. If I try it out in powerpoint it looks like this:
Maybe this can be done with GrabCut.
Attempt number 2 at mediapipe. I am using it for colorcalibration now. I think it has potential.
Currently I set the detection confidence very high (only few but accurate detections), use all the detected hand feature points from mediapipe to get the hand color (I average the color of every hand feature point) and then segment the entire image for this hand color. An alternative approach would be to lower the detection confidence (-> many but at times inaccurate detections) and only segment the detected hand areas with the hand color for that area.
Edit:
This is how the alternative approach looks.
I actually think it looks really nice so far :)
Currently the hand color is between mean - (2*std)
and mean - (2*std)
. But there is probably a better way to remove outliers from the detections than using standard deviation. I will have to look into that.