/hoi_vid

Primary LanguagePython

What is it?

  • Input: Hand-object interaction (HOI) videos with hand-object bounding box in the first frame.

  • Output: per-frame hand-object masks, hand boxes, (3D hand poses).

Installation

We rely on STCN, a great video object segmentation system:

cd ..
git clone https://github.com/hkchengrex/STCN.git
cd - 
cp scripts/run_doh.py ../STCN/

Data Formats

Prepare your own sequence into the following format:

$seq1/
    iamges/%04d.png
    bbox.json # {'hand': [x1, y1, x2, y2], 'obj': [x1, y1, x2, y2]}
$seq2/
    iamges/%04d.png
    bbox.json 
For 100DOH dataset

The script downloads some videos from 100DOH and extracts short clips around its key frames. Download from 100DOH dataset: use extract_100doh.py:download_videos(), extract_key_frames() to download and get some clips python extract_100doh.py. the clip will be saved to

output/100doh_clips/
    diy_xcvdtw_frame234325/
        # user provided bbox.json: 
        {'obj':  [x1, y1, x2, y2]], 'hand': [x1, y1, x2, y2]}
        clip.mp4
        key_frame.jpg
        frames/
            01.jpg - xx.jpg

Process Sequence

Given the bounding box of hand and object in the first frame from GT annotation, vos.sh tracks both hand and object, then gets their masks, reconstructs hand, and finds correspondence between multiple hand and objects.

  1. One-click preprocess: After change DET_DIR, DATA_DIR, RAWDIR in vos.sh
sh vos.sh $seq
  • extract masks in the first frame hoi_det.py
  • put to STCN format
  • track by STCN
  • evaluate tracking quality (by mask IoU between first frame and tracking forward and backward) and visualize masks.
  • find coresponding hand (and reconstruct hand by frank mocap, if uncomment preprocess/hand_det_w_gt.sh:L13-22)

Batchify script: list all clips and feed them through vos.sh one by one

python batch_vos.py