aharley/pips2

test_on_tap.py results don't match expected results.

AssafSinger94 opened this issue · 4 comments

Hello,
When running test_on_tap.py, I get different results than reported in the testing section.
The mean d_avg of all 30 videos (output is added below) is 72.376, compared to d_avg 70.6; survival_16 89.3; median_l2 6.9 reported.
I download the reference mode using sh get_reference_model.sh, and I test on tapvid_davis.pkl which I downloaded and unzipped from https://storage.googleapis.com/dm-tapnet/tapvid_davis.zip.

I would really appreciate any assistance and clarifications on the matter!
Assaf

Attached below is the output:

model_name 1_128_i16_tap01_132907
loading TAPVID-DAVIS dataset...
found 30 videos in ./datasets/tapvid_davis
+--------------------------------------------------------+------------+
|                        Modules                         | Parameters |
+--------------------------------------------------------+------------+
|           module.fnet.layer3.0.conv1.weight            |   110592   |
|           module.fnet.layer3.0.conv2.weight            |   147456   |
|           module.fnet.layer3.1.conv1.weight            |   147456   |
|           module.fnet.layer3.1.conv2.weight            |   147456   |
|           module.fnet.layer4.0.conv1.weight            |   147456   |
|           module.fnet.layer4.0.conv2.weight            |   147456   |
|           module.fnet.layer4.1.conv1.weight            |   147456   |
|           module.fnet.layer4.1.conv2.weight            |   147456   |
|                module.fnet.conv2.weight                |   958464   |
|    module.delta_block.first_block_conv.conv.weight     |   275712   |
| module.delta_block.basicblock_list.2.conv2.conv.weight |   196608   |
| module.delta_block.basicblock_list.3.conv1.conv.weight |   196608   |
| module.delta_block.basicblock_list.3.conv2.conv.weight |   196608   |
| module.delta_block.basicblock_list.4.conv1.conv.weight |   393216   |
| module.delta_block.basicblock_list.4.conv2.conv.weight |   786432   |
| module.delta_block.basicblock_list.5.conv1.conv.weight |   786432   |
| module.delta_block.basicblock_list.5.conv2.conv.weight |   786432   |
| module.delta_block.basicblock_list.6.conv1.conv.weight |  1572864   |
| module.delta_block.basicblock_list.6.conv2.conv.weight |  3145728   |
| module.delta_block.basicblock_list.7.conv1.conv.weight |  3145728   |
| module.delta_block.basicblock_list.7.conv2.conv.weight |  3145728   |
+--------------------------------------------------------+------------+
total params: 17.57 M
reading ckpt from ./reference_model
...found checkpoint ./reference_model/model-000200000.pth
1_128_i16_tap01_132907; step 000001/30; rtime 0.01; itime 1.09; d_x 74.9; sur_x 100.0; med_x 1.8
1_128_i16_tap01_132907; step 000002/30; rtime 0.03; itime 0.91; d_x 71.1; sur_x 85.2; med_x 2.7
1_128_i16_tap01_132907; step 000003/30; rtime 0.04; itime 0.58; d_x 69.1; sur_x 87.8; med_x 2.9
1_128_i16_tap01_132907; step 000004/30; rtime 0.03; itime 0.97; d_x 72.6; sur_x 88.2; med_x 3.4
1_128_i16_tap01_132907; step 000005/30; rtime 0.03; itime 0.68; d_x 75.9; sur_x 90.1; med_x 2.8
1_128_i16_tap01_132907; step 000006/30; rtime 0.02; itime 0.71; d_x 77.7; sur_x 87.4; med_x 2.4
1_128_i16_tap01_132907; step 000007/30; rtime 0.02; itime 0.51; d_x 76.8; sur_x 88.6; med_x 3.7
1_128_i16_tap01_132907; step 000008/30; rtime 0.02; itime 1.16; d_x 75.5; sur_x 88.4; med_x 4.0
1_128_i16_tap01_132907; step 000009/30; rtime 0.04; itime 0.87; d_x 75.4; sur_x 88.4; med_x 4.0
1_128_i16_tap01_132907; step 000010/30; rtime 0.04; itime 1.15; d_x 70.9; sur_x 83.3; med_x 9.3
1_128_i16_tap01_132907; step 000011/30; rtime 0.04; itime 1.05; d_x 71.6; sur_x 84.4; med_x 8.7
1_128_i16_tap01_132907; step 000012/30; rtime 0.04; itime 0.93; d_x 71.3; sur_x 85.4; med_x 8.2
1_128_i16_tap01_132907; step 000013/30; rtime 0.03; itime 1.15; d_x 72.8; sur_x 86.5; med_x 7.6
1_128_i16_tap01_132907; step 000014/30; rtime 0.04; itime 0.97; d_x 72.4; sur_x 86.9; med_x 7.3
1_128_i16_tap01_132907; step 000015/30; rtime 0.03; itime 0.95; d_x 71.5; sur_x 86.8; med_x 8.1
1_128_i16_tap01_132907; step 000016/30; rtime 0.03; itime 1.00; d_x 71.7; sur_x 87.6; med_x 7.7
1_128_i16_tap01_132907; step 000017/30; rtime 0.04; itime 0.69; d_x 73.4; sur_x 88.3; med_x 7.3
1_128_i16_tap01_132907; step 000018/30; rtime 0.03; itime 0.75; d_x 73.1; sur_x 89.0; med_x 7.0
1_128_i16_tap01_132907; step 000019/30; rtime 0.03; itime 0.84; d_x 72.2; sur_x 89.0; med_x 6.9
1_128_i16_tap01_132907; step 000020/30; rtime 0.03; itime 0.60; d_x 71.6; sur_x 88.7; med_x 6.8
1_128_i16_tap01_132907; step 000021/30; rtime 0.03; itime 0.69; d_x 71.1; sur_x 88.8; med_x 6.6
1_128_i16_tap01_132907; step 000022/30; rtime 0.02; itime 0.73; d_x 72.0; sur_x 89.3; med_x 6.4
1_128_i16_tap01_132907; step 000023/30; rtime 0.03; itime 0.92; d_x 71.6; sur_x 89.3; med_x 6.2
1_128_i16_tap01_132907; step 000024/30; rtime 0.04; itime 1.07; d_x 71.1; sur_x 88.5; med_x 6.6
1_128_i16_tap01_132907; step 000025/30; rtime 0.04; itime 0.60; d_x 71.5; sur_x 88.6; med_x 6.8
1_128_i16_tap01_132907; step 000026/30; rtime 0.02; itime 0.65; d_x 70.8; sur_x 88.3; med_x 7.4
1_128_i16_tap01_132907; step 000027/30; rtime 0.02; itime 0.61; d_x 70.4; sur_x 88.5; med_x 7.4
1_128_i16_tap01_132907; step 000028/30; rtime 0.03; itime 0.95; d_x 70.4; sur_x 88.7; med_x 7.3
1_128_i16_tap01_132907; step 000029/30; rtime 0.05; itime 0.64; d_x 70.4; sur_x 89.0; med_x 7.1
1_128_i16_tap01_132907; step 000030/30; rtime 0.02; itime 0.60; d_x 70.5; sur_x 89.2; med_x 7.0

In addition, I wanted to ask about the following concerns regarding this testing of TAP-Vid DAVIS:

  1. I notice that during data loading of this dataset on datasets.tapviddataset_fullseq.TapVidDavis that the "raw videos" are loaded from the pickle file, which are in 480x854 resolution, and not the 256x256 resolution videos, as described in the paper.
  2. I see that right before model inference, the video and query points are resized to image_size (seen in test_on_tap.test_on_fullseq), which is set in test_on_tap.main to (512,896). Could you please elaborate further about this resizing? when going over the paper I couldn't find any mention of this.

Thanks for the messages.

For d_avg: How did you compute 72.376? It looks like it's showing 70.5 in the snippet you posted. The d_x shown in each row is the running average, so 70.5 is the average across the 30 videos.

For resolution: I think some papers use 256x256 at test time, but we find that higher-resolution input helps performance, if you can afford it. The stats are still computed at 256x256 though.

@aharley I'm getting similar results (~70 d_x). The PointOdyssey paper reported about ~63 on this metric. I'm also getting ~7 on the MTE metric, while the paper reported ~4. My results for survival are in line with the paper. I was curious if you had changed or otherwise improved the reference model from the paper, or if there is a bug somewhere?