Low res result for single video inference (especially for the teeth area).
Opened this issue · 6 comments
Hello!
I'm having an issue with the single video inference.
Here is the input video (With the target audio):
video.mp4
Here is the output video:
output.mp4
I believe the low res can be fixed with a face enhancer, however, the result around the teeth is poorer relative to the demo in the readme.
Any idea what would cause this?
I am not very sure what the issue might be in this specific case. If I had to guess it either could be that the resolution is higher than the model is trained on or the weird rasterization in the video.
What is the optimal resolution to have better results with the base model?
Can we fine tune the model on a specific face for better results? I'm wondering if we could smoothen the result one way or another because the lip movement is otherwise pretty good.
So, we trained on the VoxCeleb2 dataset which is one of the standard datasets but is not HD quality. From there the faces are cropped (from the forehead to the chin/ cheek to cheek) and this is used at 128x128 resolution. So, videos that are sharper than that resolution might suffer.
I guess one could easily fine-tune the model for a specific video (or even use LORA-style finetuning). We have not tried it. I would guess that it would be able to get sharper features than before.
(It would probably be even better to train/finetune the model on sharper datasets like AVSpeech or something better.)
We tried with a lower resolution and the result is similar...
How should we proceed to train the model? Can we do it with images of the face only? Or should we also add intial audio to map the lip movement?
Hi, let me know if you find the fix for that please having similar issue
Same issue