fan23j/yolov5-vitpose-video-annotator

Support for Selective Person Tracking in Multi-Person Videos

DavidTu21 opened this issue · 2 comments

Hi there!

Firstly, I'd like to extend my heartfelt gratitude for the incredible work on this project. The functionality and performance have been outstanding, and it's been instrumental in my current work involving pose estimation.

I'm currently utilizing the script to generate Alphapose-like outputs from VitPose, subsequently feeding this into MotionBert for 3D pose estimation. The results have been promising; however, I've encountered a specific challenge that I'd like to discuss.

In my current setup, I'm dealing with videos that feature multiple people. VitPose efficiently identifies all individuals, but for my purposes, I need to track a specific person through the video. This is crucial because, as noted in MotionBert documentation, they currently support single-person analysis only. In MotionBert, it's mentioned: "Note: Currently we only support single person. If your video contains multiple person, you may need to use the Pose Tracking Module for AlphaPose and set --focus to specify the target person id."

Given this requirement, I'm curious if we have a similar functionality in VitPose or if there's a workaround that can be implemented. Specifically, I'm looking for a feature that would allow me to select and track a single person in a video with multiple individuals, akin to the '--focus' option in AlphaPose.

Thank you once again for your amazing work and support.

fan23j commented

Hello,

Currently, the implementation for inference with ViTPose on video simply extracts the frames from the input video and runs inference on each frame independently. Therefore, there are no associated detections IDs that would be directly beneficial to your application.

Depending on your requirements, I think there are two straightforward options for you to try:

  1. If you know beforehand which individual you want to track, you can manually preprocess the first couple frames of the video. Ex. If you want to only extract the frames from the lead dancer of a music video, you can manually crop out the individual in the first frame and fill the rest of the frame with black pixels. This would allow the ViTPose model to only output the predictions for the only individual in the frame. You would then use these predictions to preprocess the inputs of subsequent frames (ie. use the bounding box of the prediction to crop the next frame). This won't work well with scale and will not work well if your target individual overlaps with any other individual in any frame.
  2. (My recommendation) You can try out the https://github.com/open-mmlab/mmtracking repository. You should be able to easily replace the annotation step using ViTPose with the pose-tracking options provided by mmtracking. Actually, they just recently added MotionBERT to their library, so maybe they will continue expanding on the project and encompass your needs.

I hope this helps.

Hello,

Thank you very much for your detailed explanation and your thoughts for my requirements, I will take a deep look into the comments you mentioned. Thank you again for your time and the amazing work!