Evaluation results

Question

Evaluation results

ymxie97 opened this issue 6 months ago · 3 comments

Hi, thanks for this excellent work.

I was evaluating your given results in test_dataset with the evaluation scripts in this repo. Here are results I got:
Image_level

{
   "clip": {
      "trump": 0.9106602878309786,
      "aurorus": 0.869874659460038,
      "monster": 0.9317637719213963,
      "skull": 0.9398875301703811,
      "pistol": 0.9096568799577653,
      "guppie": 0.9233718882314861,
      "crocodile": 0.8745720302686095,
      "average": 0.9085410068343792
   },
   "lpips": {
      "trump": 0.17138301441445947,
      "aurorus": 0.12868298264220357,
      "monster": 0.16232836386188865,
      "skull": 0.1474510586122051,
      "pistol": 0.08828477596398443,
      "guppie": 0.11274099512957036,
      "crocodile": 0.11016011907486245,
      "average": 0.1315759013855963
   },
   "psnr": {
      "trump": 14.756038427352905,
      "aurorus": 14.504847288131714,
      "monster": 19.112009048461914,
      "skull": 20.408036708831787,
      "pistol": 20.574809551239014,
      "guppie": 15.90876841545105,
      "crocodile": 15.48453688621521,
      "average": 17.24986376081194
   },
   "ssim": {
      "trump": 0.8515339195728302,
      "aurorus": 0.9023219794034958,
      "monster": 0.8700040280818939,
      "skull": 0.894202470779419,
      "pistol": 0.9272156208753586,
      "guppie": 0.8855190426111221,
      "crocodile": 0.8963496088981628,
      "average": 0.889592381460326
   }
}

Video level:

{
   "trump": 1049.999203902612,
   "aurorus": 1939.7497665968854,
   "monster": 1141.1691800864905,
   "skull": 2028.6507870953083,
   "pistol": 1663.0212500789262,
   "guppie": 1131.7669306778407,
   "crocodile": 1675.7212055845519,
   "average": 1518.5826177175165
}

The numbers are different from (better than) the results shown in the paper. Could you please provide some insights about what probably leads to this?

BTW I also tested STAG4D results using this scripts but the results (both image-level and video-level) are also different from the numbers presented in STAG4D paper (especially FVD, 1543 (I got) vs 992.21 (in paper)) .

Answer 1 · 2024-06-21T02:09:13.000Z

Hi, I guess you might forget to transform rgba gt images to rgb images with white background. All predictions results are in white background, but the rgba gt images is in black background. I've mentioned this tip in README,md,

After transforming the background, you'll be able to see similar results as reported in the papers.

Answer 2 · 2024-06-21T04:40:26.000Z

Thank you for your prompt response! I will try to add white background.

Answer 3 · 2024-06-22T18:45:54.000Z

Hi Yanqin, after adding a white background to gt images, the FVD comes normal. All metrics are better than the numbers reported in the paper:
lpips: 0.122 (reproduced) vs 0.16 (paper), clip: 0.91 (reproduced) vs 0.87 (paper), fvd: 992.92 (reproduced) vs 1133.93 (from stag4d paper).
You probably improved the results after the paper submission?

But anyway thank you so much for your reply. I will close this issue for now.