vt-vl-lab/iCAN

Demo on video

sophia-wright-blue opened this issue · 11 comments

Hello,

Thank you for releasing the code. On the home page, you have a demo on a video (two people talking from the Big Bang Theory). In the README, you have instructions for Demo/Test on our own images.

Could you guide me on the process to do a demo on my own video? what are the steps to obtaining the results on an mp4 file?

Thank you,

Hi,

In order to test on your own images, you have to follow all steps in the Demo/Test on your own images section. Those are:

Clone and setup the tf-faster-rcnn repository.
Convert your video to PNG sequences and put them in demo/ folder.
Detect all objects
Detect all HOIs
Step 3 will save Object_Detection.pkl in the demo folder, step 4 will save HOI_Detection.pkl in the demo folder. Then you can use tools/Demo.ipynb to visualize all the HOI detections.

Hope this helps.

Thank you for the quick reply @gaochen315 ,to convert the video to PNG sequences, would I have to use something like ffmpeg?

for the result, will tools/Demo.ipynb give the output as the HOI interactions on the video?

I'd like to be able to input an mp4 file and get the HOI interactions on the mp4 file, exactly as you have demonstrated on the home page and on the project page.

thank you

Yes. You need to use ffmpeg for the conversion.

tools/Demo.ipynb will plot the images with interactions annotated. You can save the visualization instead of plotting them, and then convert the PNGs back to video.

ah, thank you @gaochen315 i get it now! so I guess I should use ffmpeg to convert the PNGs back to video?

ffmpeg as well. The following function will do the job
ffmpeg -r 25 -f image2 -i pic%03d.png -pix_fmt yuv420p HOI.mp4

thank you so much for the super fast reply @gaochen315 ! Looking forward to using the repo.

apologize for reopening the issue @gaochen315 . a related question - if the HOI interaction output is done only on the individual frames, does the temporal aspect of the video get considered? for example, 3D CNNs or CNN+RNN models consider the spatial and temporal aspects of videos. I hope my question is clear

Currently, we only care about frame-level interaction detection. However, jointly considering temporal information is definitely a correct direction and it's worthy to explore.

got it, if its not too much trouble, could you point me to a paper or two that does consider the temporal information as well?

@sophia-wright-blue
Hi, I just read your conversation. Have you successfully done on with video detection?