/MocapNET

We present MocapNET, a real-time method that estimates the 3D human pose directly in the popular Bio Vision Hierarchy (BVH) format, given estimations of the 2D body joints originating from monocular color images. Our contributions include: (a) A novel and compact 2D pose NSRM representation. (b) A human body orientation classifier and an ensemble of orientation-tuned neural networks that regress the 3D human pose by also allowing for the decomposition of the body to an upper and lower kinematic hierarchy. This permits the recovery of the human pose even in the case of significant occlusions. (c) An efficient Inverse Kinematics solver that refines the neural-network-based solution providing 3D human pose estimations that are consistent with the limb sizes of a target person (if known). All the above yield a 33% accuracy improvement on the Human 3.6 Million (H3.6M) dataset compared to the baseline method (MocapNET) while maintaining real-time performance

Primary LanguageC++OtherNOASSERTION

MocapNET Project

MocapNET

News


8-11-2021

MocapNET3 with hand pose estimation support has landed in this repository! The latest version that has been accepted in BMVC2021 is now commited in the mnet3 branch of this repository. Since however there is considerable code-polish missing and currently the 2D joint estimator offered does not contain hands there needs to be a transition to a 2D joint estimator like Mediapipe Holistic for a better live webcam demo. MocapNET3 will appear in the 32nd British Machine Vision Conference that will be held virtually and is free to attend this year!!

An upgraded 2020 version of MocapNET has landed! It contains a very big list of improvements that have been carried out during 2020 over the original work that allows higher accuracy, smoother BVH output and better occlusion robustness while maintaining realtime perfomance. MocapNET2 will appear in the 25th International Conference on Pattern Recognition

If you are interested in the older MocapNET v1 release you can find it in the mnet1 branch,

Visualization Example: With MocapNET2 an RGB video feed like this can be converted to BVH motion frames in real-time. The result can be easily used in your favourite 3D engine or application.

Sample run

Example Output:

Youtube Video MocapNET Output Editing on Blender
YouTube Link BVH File Blender Video

Ensemble of SNN Encoders for 3D Human Pose Estimation in RGB Images


We present MocapNET v2, a real-time method that estimates the 3D human pose directly in the popular Bio Vision Hierarchy (BVH) format, given estimations of the 2D body joints originating from monocular color images.

Our contributions include:

  • A novel and compact 2D pose NSRM representation.
  • A human body orientation classifier and an ensemble of orientation-tuned neural networks that regress the 3D human pose by also allowing for the decomposition of the body to an upper and lower kinematic hierarchy. This permits the recovery of the human pose even in the case of significant occlusions.
  • An efficient Inverse Kinematics solver that refines the neural-network-based solution providing 3D human pose estimations that are consistent with the limb sizes of a target person (if known).

All the above yield a 33% accuracy improvement on the Human 3.6 Million (H3.6M) dataset compared to the baseline method (MocapNET v1) while maintaining real-time performance (70 fps in CPU-only execution).

MocapNET

Youtube Videos


ICPR 2020 Supplementary Video BMVC 2019 Supplementary Video
YouTube Link YouTube Link
ICPR 2020 Poster Session BMVC 2021 Supplementary Video
YouTube Link YouTube Link

Citation


Please cite the following papers 1,2 if this work helps your research :

@inproceedings{Qammaz2021,
  author = {Qammaz, Ammar and Argyros, Antonis A},
  title = {Towards Holistic Real-time Human 3D Pose Estimation using MocapNETs},
  booktitle = {British Machine Vision Conference (BMVC 2021)},
  publisher = {BMVA},
  year = {2021},
  month = {November},
  projects =  {I.C.HUMANS},
  videolink = {https://www.youtube.com/watch?v=aaLOSY_p6Zc}
}
@inproceedings{Qammaz2020,
  author = {Ammar Qammaz and Antonis A. Argyros},
  title = {Occlusion-tolerant and personalized 3D human pose estimation in RGB images},
  booktitle = {IEEE International Conference on Pattern Recognition (ICPR 2020), (to appear)},
  year = {2021},
  month = {January},
  url = {http://users.ics.forth.gr/argyros/res_mocapnet_II.html},
  projects =  {Co4Robots},
  pdflink = {http://users.ics.forth.gr/argyros/mypapers/2021_01_ICPR_Qammaz.pdf},
  videolink = {https://youtu.be/Jgz1MRq-I-k}
}
@inproceedings{Qammaz2019,
  author = {Qammaz, Ammar and Argyros, Antonis A},
  title = {MocapNET: Ensemble of SNN Encoders for 3D Human Pose Estimation in RGB Images},
  booktitle = {British Machine Vision Conference (BMVC 2019)},
  publisher = {BMVA},
  year = {2019},
  month = {September},
  address = {Cardiff, UK},
  url = {http://users.ics.forth.gr/argyros/res_mocapnet.html},
  projects =  {CO4ROBOTS,MINGEI},
  pdflink = {http://users.ics.forth.gr/argyros/mypapers/2019_09_BMVC_mocapnet.pdf},
  videolink = {https://youtu.be/fH5e-KMBvM0}
}

Overview, System Requirements and Dependencies


MocapNET is a high performance 2D to 3D single person pose estimator. This code base targets recent Linux (Ubuntu 18.04 - 20.04 +) machines, and relies on the Tensorflow C-API and OpenCV. Windows 10 users can try the linux subsystem that has been also reported to work.

Tensorflow is used as the Neural Network framework for our work and OpenCV is used to enable the acquisition of images from webcams or video files as well as to provide an easy visualization method.

We have provided an initialization script that automatically handles most dependencies, as well as download all needed pretrained models. After running it the application should be ready for use. To examine the neural network .pb files provided you can download and use Netron.

Any issues not automatically resolved by the script can be reported on the issues section of this repository.

This repository contains 2D joint estimators for the MocapNET2LiveWebcamDemo. By giving it the correct parameters you can switch between a cut-down version of OpenPose (--openpose), VNect (--vnect) or our own MobileNet (default) based 2D joint estimator. All of these are automatically downloaded using the initialize.sh script. However in order to achieve higher accuracy estimations you are advised to set up a full OpenPose instance and use it to acquire JSON files with 2D detections that can be subsequently converted to CSV using convertOpenPoseJSONToCSV and then to 3D BVH files using the MocapNET2CSV binary. They will provide superior accuracy compared to the bundled 2D joint detectors which are provided for faster performance in the live demo, since 2D estimation is the bottleneck of the application. Our live demo will try to run the 2D Joint estimation on your GPU and MocapNET 3D estimation on the system CPU to achieve a combined framerate of over 30 fps which in most systems matches or surpasses the acquisition rate of web cameras. Unfortunately there are many GPU compatibility issues with Tensorflow C-API builds since recent versions have dropped CUDA 9.0 support as well as compute capabilities that might be required by your system, you can edit the initialize.sh script and change the variable TENSORFLOW_VERSION according to your needs. If you want CUDA 9.0 you should se it to 1.12.0. If you want CUDA 9.0 and have a card with older compute capabilities (5.2) then choose version 1.11.0. If all else fails you can always recompile the tensorflow C-API to match your specific hardware configuration. You can also use this script that automates building tensorflow r1.15 that might help you, dealing with the Bazel build system and all of its weirdness. Release 1.15 is the final of the 1.x tensorflow tree and is compatible with MocapNET, Tensorflow 2.x is also supported, according to the Tensorflow site, version 2.3 is the first version of the 2.x tree to re-include C bindings. The initialize.sh script will ask you which version you want to use and try to download it and set it up locally for your MocapNET installation.

If you are interested in generating BVH training data for your research, we have also provided the code that handles randomization and pose perturbation from the CMU dataset. After a successful compilation, dataset generation is accessible using the scripts scripts/createRandomizedDataset.sh and scripts/createTestDataset.sh. All BVH manipulation code is imported from a secondary github project that is automatically downloaded, included and built using the initialize.sh script. These scripts/createRandomizedDataset.sh and scripts/createTestDataset.sh scripts will populate the dataset/ directory with CSV files that contain valid training samples based on the CMU dataset. It is trivial to load these files using python. After loading them using them as training samples in conjunction with a deep learning framework like Keras you can facilitate learning of 2D to 3D BVH.

Building the library


To download and compile the library issue :

sudo apt-get install git build-essential cmake libopencv-dev libjpeg-dev libpng-dev libglew-dev libpthread-stubs0-dev

git clone https://github.com/FORTH-ModelBasedTracker/MocapNET

cd MocapNET

./initialize.sh

After performing changes to the source code, you do not need to rerun the initialization script. You can recompile the code by using :

cd build 
cmake .. 
make 
cd ..

Updating the library


The MocapNET library is under active development, the same thing is true for its dependencies.

In order to update all the relevant parts of the code you can use the update.sh script provided.

./update.sh

If you made changes to the source code that you want to discard and want to revert to the master you can also use the revert.sh script provided

./revert.sh

Testing the library and performing benchmarks


To test your OpenCV installation as well as support of your webcam issue :

./OpenCVTest --from /dev/video0 

To test OpenCV support of your video files issue :

./OpenCVTest --from /path/to/yourfile.mp4

These tests only use OpenCV (without Tensorflow or any other dependencies) and are intended as a quick method that can identify and debug configuration problems on your system. In case of problems playing back video files or your webcam you might want to consider compiling OpenCV yourself. The scripts/getOpenCV.sh script has been included to automatically fetch and make OpenCV for your convinience. The CMake file provided will automatically try to set the OpenCV_DIR variable to target the locally built version made using the script. If you are having trouble switching between the system version and the downloaded version consider using the cmake-gui utility or removing the build directory and making a fresh one, once again following the Building instructions. The new build directory should reset all paths and automatically see the local OpenCV version if you used the scripts/getOpenCV.sh script and use this by default.

Live Demo


Assuming that the OpenCVTest executable described previously is working correctly with your input source, to do a live test of the MocapNET library using a webcam issue :

./MocapNET2LiveWebcamDemo --from /dev/video0 --live

To dump 5000 frames from the webcam to out.bvh instead of the live directive issue :

./MocapNET2LiveWebcamDemo --from /dev/video0 --frames 5000

To control the resolution of your webcam you can use the --size width height parameter, make sure that the resolution you provide is supported by your webcam model. You can use the v4l2-ctl tool by executing it and examining your supported sensor sizes and rates. By issuing --forth you can use our FORTH developed 2D joint estimator that performs faster but offers lower accuracy

 v4l2-ctl --list-formats-ext
./MocapNET2LiveWebcamDemo --from /dev/video0 --live --forth --size 800 600

Testing the library using a pre-recorded video file (i.e. not live input) means you can use a slower but more precise 2D Joint estimation algorithm like the included OpenPose implementation. You should keep in mind that this OpenPose implementation does not use PAFs and so it is still not as precise as the official OpenPose implementation. To run the demo with a prerecorded file issue :

./MocapNET2LiveWebcamDemo --from /path/to/yourfile.mp4 --openpose

We have included a video file that should be automatically downloaded by the initialize.sh script. Issuing the following command should run it and produce an out.bvh file even if you don't have any webcam or other video files available! :

./MocapNET2LiveWebcamDemo --from shuffle.webm --openpose --frames 375

Since high-framerate output is hard to examine, if you need some more time to elaborate on the output you can use the delay flag to add programmable delays between frames. Issuing the following will add 1 second of delay after each processed frame :

./MocapNET2LiveWebcamDemo --from shuffle.webm --openpose --frames 375 --delay 1000

If your target is a headless environment then you might consider deactivating the visualization by passing the runtime argument --novisualization. This will prevent any windows from opening and thus not cause issues even on a headless environment.

BVH output files are stored to the "out.bvh" file by default. If you want them to be stored in a different path use the -o option. They can be easily viewed using a variety of compatible applicatons. We suggest Blender which is a very powerful open-source 3D editing and animation suite or BVHacker that is freeware and compatible with Wine

MocapNETLiveWebcamDemo default visualization

./MocapNET2LiveWebcamDemo --from shuffle.webm --openpose --show 0 --frames 375

MocapNETLiveWebcamDemo all-in-one visualization

./MocapNET2LiveWebcamDemo --from shuffle.webm --openpose --show 3 --frames 375

MocapNETLiveWebcamDemo rotation per joint visualization

./MocapNET2LiveWebcamDemo --from shuffle.webm --openpose --show 1 --frames 375

By using the --show variable you can alternate between different visualizations. A particularly useful visualization is the "--show 1" one that plots the joint rotations as seen above.

MocapNETLiveWebcamDemo OpenGL visualization

./MocapNET2LiveWebcamDemo --from shuffle.webm --openpose --show 0 --opengl --frames 375

By executing "sudo apt-get install freeglut3-dev" to get the required libraries, then enabling the ENABLE_OPENGL CMake configuration flag during compilation and using the --opengl flag when running the MocapNET2LiveWebcamDemo you can also see the experimental OpenGL visualization illustrated above, rendering a skinned mesh that was generated using makehuman. The BVH file armature used corresponds to the CMU+Face armature of makehuman.

./MocapNET2LiveWebcamDemo --from shuffle.webm --openpose --gestures --frames 375

By starting the live demo using the --gestures argument you can enable an experimental simple form of gesture detection as seen in the illustration above. Gestures are stored as BVH files and controlled through the gestureRecognition.hpp file. A client application can register a callback as seen in the demo. The gesture detection code is experimental and has been included as a proof of concept, since due to our high-level output you can easily facilitate gesture detections by comparing subsequent BVH frames as seen in the code. That being said gestures where not a part of the original MocapNET papers.

ROS (Robot Operating System) node


mocapnet_rosnode screenshot with rviz

If you are interested in ROS development and looking for a 3D pose estimator for your robot, you are in luck, MocapNET has a ROS node! You can get it here!

Tuning Hierarchical Coordinate Descent for accuracy/performance gains


As described in the paper, the Hierarchical Coordinate Descent Inverse Kinematics algorithm has various hyper-parameters that have been set to default values after experiments. Depending on your deployment scenarios you might to sacrifice some performance for better accuracy. You can do this by altering the IK tuning parameters by using the --ik switch

A default run without the --ik switch is equivalent to a run using a learning rate of 0.01, 5 iterations, 30 epochs. The iterations variable has the biggest impact in performance.

A normal run without the --ik flag is equivalent to

./MocapNET2LiveWebcamDemo --from shuffle.webm --ik 0.01 5 30

If you want a very high accuracy run and don't care about framerate as much consider

./MocapNET2LiveWebcamDemo --from shuffle.webm --ik 0.01 15 40

The IK module supports tailoring the model used for posed estimation to your liking using the "--changeJointDimensions neckLength torsoLength chestWidth shoulderToElbowLength elbowToHandLength waistWidth hipToKneeLength kneeToFootLength shoeLength as well as the focal length of your specific camera using "--focalLength fx fy" The following example will try to track the shuffle.webm sample assuming a body with feet 150% the normal size and a focal length of 600 on x and y

./MocapNET2LiveWebcamDemo --from shuffle.webm --ik 0.01 25 40 --changeJointDimensions 1.0 1.0 1.0 1.0 1.0 1.5 1.5 1.5 1.0 --focalLength 600 600

If you don't care about fine results and just want a rough pose estimation extracted really fast you can completely switch the IK module off using

./MocapNET2LiveWebcamDemo --from shuffle.webm --noik

Headless deployment


When deploying the code on headless environments like Google Colab where there is no display available you might experience errors like

(3D Points Output:xxxx): Gtk-WARNING **:  cannot open display: 

To overcome these errors just use the --novisualization switch to disable visualization windows

Higher accuracy using OpenPose JSON files


In order to get higher accuracy output compared to the live demo which is more performance oriented, you can use OpenPose and the 2D output JSON files produced by it. The convertOpenPoseJSONToCSV application can convert them to a BVH file. After downloading OpenPose and building it you can use it to acquire 2D JSON body pose data by running :

build/examples/openpose/openpose.bin -number_people_max 1 --hand --write_json /path/to/outputJSONDirectory/ -video /path/to/yourVideoFile.mp4

This will create files in the following fashion /path/to/outputJSONDirectory/yourVideoFile_XXXXXXXXXXXX_keypoints.json Notice that the filenames generated encode the serial number by padding it up to 12 characters (marked as X). You provide this information to our executable using the --seriallength commandline option.

The dump_and_process_video.sh script has been included that can be used to fully process a video file using openpose and then process it through MocapNET, or act as a guide for this procedure.

A utility has been included that can convert the JSON files to a single CSV file issuing :

 ./convertOpenPoseJSONToCSV --from /path/to/outputJSONDirectory/ --label yourVideoFile --seriallength 12 --size 1920 1080 -o .

For more information on how to use the conversion utility please see the documentation inside the utility

A CSV file has been included that can be run by issuing :

 ./MocapNET2CSV --from dataset/sample.csv --visualize --delay 30

The delay is added in every frame so that there is enough time for the user to see the results, of course the visualization only contains the armature since the CSV file does not have the input images.

Check out this guide contributed by a project user for more info.

Experimental utilities


The repository contains experimental utilities used for the development of the papers.

The CSV cluster plot utility if you choose to download the CMU-BVH dataset using the ./initialize.sh script will allow you to perform the clustering experiments described.

CSV cluster plot utility

./CSVClusterPlot

The BVHGUI2 is a very minimal utility you can use to become more familiar with the BVH armature used by the project. Using easy to use sliders you can animate the armature and it is has a minimal source code.

BVH GUI utility

./BVHGUI2 --opengl

License


This library is provided under the FORTH license