Stanford-TML/EDGE

Clarification on Reproducing PFC Score from EDGE Paper

AkideLiu opened this issue · 2 comments

Dear EDGE Authors,

I am fascinated by your paper and wish to gain a better understanding of your methodology, particularly concerning the process to replicate the evaluation metrics detailed in your publication.

In your work, it's mentioned that for automatic evaluations such as PFC, beat alignment, Distk, and Distg, 5-second clips were obtained from each model using slices from the test music set with a 2.5-second stride. However, the process of deriving these 5-second clips from the AIST++ dataset remains unclear to me. The dataset test set, as far as I understand, comprises 20 musical pieces ranging from 8-15 seconds each.

I have attempted to replicate the PFC metrics using the following approaches:

  1. I used the raw input music from the AIST++ test set, selecting the initial 5 seconds, resulting in a PFC of 1.6836428132824115.
  2. I made use of pre-split slices in the test set, incorporating 186 samples, each 5 seconds long, which led to a PFC of 1.2385500567535723.
  3. By employing the original implementation in the test.py file with an output length (interpreted as a random slice selection for motion generation), two generation runs yielded PFCs of 1.5676957425076252 and 1.7647031391283114, respectively.

I am trying to replicate the PFC score (1.5363) reported in your paper, and I would greatly appreciate your guidance in this matter. Please let me know if there are any misconceptions in my understanding.

Looking forward to your kind assistance.

@AkideLiu I tried approach 1, same to you, I got a result at 1.628. However, approach 2 seems more reasonable.

Have you ever tried to compute the PFC metrics of ground truth motion? I use smpl.forward(q, pos) to compute the full_pose for evaluation but result in a very low score ~0.3.