Project page | Paper | Video | Arxiv
- 最大輸出解析度:256x256
- 最大時長:無限制
- 輸出影片幀率:22 fps
- 輸入中文嘴型準確度較差,推測是跟訓練模型用的dataset有關
- 背景內容較複雜時容易看出扭曲
- 正臉效果較佳,推測與訓練用dataset跟人臉標記重疊有關
- 正臉,臉部特徵(特別是嘴形)容易辨識、背景簡單
- 新聞播報、線上教學、演說、訪談等人只需露出肩部以上且沒有太大移動的說話情境
- OS:Ubuntu 20.04
- GPU:NVIDIA GeForce® RTX 2070 SUPER
- GPU Memory:8 GB
- Memory:32 GB
- CUDA Driver:470.82.00
- CUDA:11.4
- cuDNN:8.x.x
- Python version(local):3.8.10
- Python version(docker):3.6.9
- 模型大小(GPU):986.6 MB
- GPU 記憶體用量:5 GB(4877 MB)
- Inference 所需時間
輸入聲音長度 花費時間 1 4s 11.7s 2 8s 17.6s 3 9s 17.9s 4 17s 27.8s
Download the following pre-trained models to examples/ckpt
folder for testing your own animation.
Model | Link to the model |
---|---|
Voice Conversion | Link |
Speech Content Module | Link |
Speaker-aware Module | Link |
Image2Image Translation Module | Link |
Non-photorealistic Warping (.exe) | Link |
- 在 host 上安裝 Docker and NVIDIA Container Toolkit 以使用 GPU-enabled docker.
- 如果欲使用 CUDA 和 GPU,確認 host 的 CUDA driver 是否支援 CUDA 11.1,如果不支援,可以使用較低版本的 CUDA image,另外修改安裝的 torch 版本。
- Build the image with
docker build -t makeittalk:latest .
- Run the container with
docker run -it --gpus all makeittalk:latest bash
- 進入
./src
- 將圖片(256*256)放入
src/examples
。 - 將音檔放入
src/examples
(讓src/examples
內只有一個.wav檔) - Run
python main_end2end.py --jpg <portrait_file.jpg>
.
-
使用 API service 之前先建立 Cloudinary 帳號。
-
Install packages
$ cd src $ pip install -r requirement.txt
-
Set Cloudinary environment variables
$ export CLOUDINARY_API_KEY='123451234512345' $ export CLOUDINARY_API_SECRET='AsdfghjklAsdfghjklAsdfghjkl' $ export CLOUDINARY_CLOUD_NAME='mycloud123'
-
Run API service locally
$ uvicorn main:app --host 0.0.0.0 --port 8080 bash
-
Use the image
makeittalk:latest
built in previous part.$ docker run --gpus all -p 127.0.0.1:8080:80 -it makeittalk:latest bash root@9b3825bb8d5d:/work/src$ export CLOUDINARY_API_KEY='123451234512345' root@9b3825bb8d5d:/work/src$ export CLOUDINARY_API_SECRET='AsdfghjklAsdfghjklAsdfghjkl' root@9b3825bb8d5d:/work/src$ export CLOUDINARY_CLOUD_NAME='mycloud123' root@9b3825bb8d5d:/work/src$ uvicorn main:app --host 0.0.0.0 --port 80
URL | HTTP Method | Request | Response |
/audio2vid | POST | { "audio": String, "image": String } | { "output_url": String } |
// sample input
{
"audio": "https://your.audio/audio.wav",
"image": "https://your.image/image.jpg"
}
// sample output
{
"output_url": "http://res.cloudinary.com/mycloud123/video/upload/v0123456789/makeittalk-outputs/rhuifh83hf4xnf8944j3.mp4"
}
We would like to thank Timothy Langlois for the narration, and Kaizhi Qian for the help with the voice conversion module. We thank Jakub Fiser for implementing the real-time GPU version of the triangle morphing algorithm. We thank Daichi Ito for sharing the caricature image and Dave Werner for Wilk, the gruff but ultimately lovable puppet.
This research is partially funded by NSF (EAGER-1942069) and a gift from Adobe. Our experiments were performed in the UMass GPU cluster obtained under the Collaborative Fund managed by the MassTech Collaborative.