how to compute TL-ID and TG-ID?
laolongboy opened this issue · 5 comments
Thanks.
@laolongboy
Hi,
I plan to upload the metrics code to the repository soon.
In the meantime, I uploaded the implementation of the metrics here: https://gist.github.com/rotemtzaban/d2f0a72e790a60d5390553048809e3d5
The function measure_metrics
calculates the metrics for a single pair of (source, edit).
To compute on multiple videos we simply averaged the metric on all such pairs.
The code depends on the insightface library.
There is something odd about the similarity metrics. The values can exceed 1.0. When I calculated them on source and edited videos using another StyleGAN editing approach, I got such values (1.04 and 1.09) for Local and global. I looked at your code, and I think it is all sensible since depending on head motion in the real videos, identity can be "better" controlled in the edit. It makes me wonder how useful these two metrics are in their current form. Even the reduced form of comparing identity within each video alone (i.e., source and edited) doesn't necessarily say much, the edited can have higher id preservation.
Thanks for sharing!
@yaseryacoob
Hi,
I'll preface by noting that identity based metrics all suffer from an issue where you can 'fool' the metric with 'worse' performance on the target task. For example, in simple editing methods, if you fail to change anything about the image, you'll also naturally get a high identity similarity score. In spite of this limitation, identity metrics are widely used in editing comparisons since identity is one of the few things we can at least attempt to quantify.
Our metric is not free of such flaws, of course. As you noted, it can also produce scores that are higher than 1.0. For example, if your editing method simply outputs the original frame over and over again, or if it otherwise overly smooths the video, it will lead to better identity preservation than the original video and a score higher than 1.0.
In a sense, we would argue that this 'overshooting' is actually a desirable property of the metric, since it means that it's harder to fool it by simply minimizing change, as these situations will give scores greater than 1.0. Perhaps a good way to look at preservation here is not by asking "what method gives a higher score?", but rather: "what method gives a score closer to 1.0?".
These metrics can of course always be supplemented by other relevant evaluations, such as measuring the identity similarity against the original video (frame by frame) and ensuring that editing was meaningful by measuring the extent of attribute change with some attribute classifier / regressor.
However, our work uses existing, unchanged editing methods so we concerned ourselves with evaluating temporal consistency and not with the actual frame-by-frame quality of editing.
A method which outputs the same frame over and over again is perfectly temporally consistent, and in this sense our metric will capture this fact very well. Our metrics aim to measure consistency and match it against the original video, not provide a score for the quality of editing itself.
Thank you for your comments!
@rotemtzaban
I completely agree that the ID metrics by and large are NOT delivering what we actually want. So I don't mean to be critical of TL/TG-ID, I was intrigued when I saw the table in the paper, and on a first look it didn't occur to me that the metric can stray above 1.0. I am not sure that aiming for 1.0 solves the puzzle of how to compare the IDs, especially in your paper as the edits include age and gender. It makes me wonder if ID is even a relevant metric for these edits. Even with face expression edits, ID may get erratic as it should, since the actual face is different, despite us mentally calling it the same ID.
Add to it that the ID networks by and large are, in my opinion, not suitable for the objective. I don't think they capture the subtleties of the problem, but it is all what we have.
Anyway, to avoid a philosophical discussion, I wonder whether replacing the divide operation by cross correlation is a safer choice for at least face expression edits? Alternatively, the "ideal" criteria for performance can be posed as (1) the variance in identity over the entire edited video is zero (2) the closest real-frame ID distance to a given edited frame is 1.0 (for non age/gender examples), i.e., the edited frame is indistinguishable from a real one. I wish I can come up with better metrics, the editing task is likely ill-defined.
Cheers
Hi @yaseryacoob,
TL/TG-ID metrics were deliberately designed to measure temporal coherence and not the quality of identity preservation in the individual frame edits / inversions themselves.
Your proposed minimal variance metric can also target the same goal, but please note that it suffers from the same flaw and actually doesn't let you identify cases where the editing method destroys part of the original content. For example, consider the case of a network that outputs the exact same image for all frames. This network will score perfectly on the minimal variation metric. This may be indented, as such a video is indeed perfectly consistent, but it wouldn't be as clear that there's an over-smoothing effect.
In other words - synthesized videos can absolutely reach better temporal coherence than real ones, and our metric does capture this. But it also gives you a way to identify this scenario if it's not desirable (which it won't be in most cases).
Your concern is of course justified. Temporal coherence alone is definitely not enough, we might want to better measure the overall realism of videos, including not only temporal consistency but also a per-frame realism and maybe even ensuring that the motion in the video makes sense. To my knowledge - using a single metric for all of those components is still an open question.