Demo: MakeItTalk

Project page | Paper | Video | Arxiv

輸出範例

限制

Technical Limitation

最大輸出解析度：256x256
最大時長：無限制
輸出影片幀率：22 fps

Quality

輸入中文嘴型準確度較差，推測是跟訓練模型用的dataset有關
背景內容較複雜時容易看出扭曲
正臉效果較佳，推測與訓練用dataset跟人臉標記重疊有關

適用場合

正臉，臉部特徵（特別是嘴形）容易辨識、背景簡單
新聞播報、線上教學、演說、訪談等人只需露出肩部以上且沒有太大移動的說話情境

測試硬體規格

OS：Ubuntu 20.04
GPU：NVIDIA GeForce® RTX 2070 SUPER
GPU Memory：8 GB
Memory：32 GB
CUDA Driver：470.82.00
CUDA：11.4
cuDNN：8.x.x
Python version(local)：3.8.10
Python version(docker)：3.6.9

實際硬體用量

模型大小（GPU）：986.6 MB
GPU 記憶體用量：5 GB（4877 MB）
Inference 所需時間

輸入聲音長度花費時間

1 4s 11.7s

2 8s 17.6s

3 9s 17.9s

4 17s 27.8s

	輸入聲音長度	花費時間
1	4s	11.7s
2	8s	17.6s
3	9s	17.9s
4	17s	27.8s

Pre-trained Models

Download the following pre-trained models to examples/ckpt folder for testing your own animation.

Model	Link to the model
Voice Conversion	Link
Speech Content Module	Link
Speaker-aware Module	Link
Image2Image Translation Module	Link
Non-photorealistic Warping (.exe)	Link

安裝方法 (docker)

在 host 上安裝 Docker and NVIDIA Container Toolkit 以使用 GPU-enabled docker.
如果欲使用 CUDA 和 GPU，確認 host 的 CUDA driver 是否支援 CUDA 11.1，如果不支援，可以使用較低版本的 CUDA image，另外修改安裝的 torch 版本。
Build the image with docker build -t makeittalk:latest .
Run the container with docker run -it --gpus all makeittalk:latest bash

Test Model

進入 ./src
將圖片（256*256）放入 src/examples。
將音檔放入 src/examples（讓 src/examples 內只有一個.wav檔）
Run python main_end2end.py --jpg <portrait_file.jpg> .

MakeItTalk API Service

Run MakeItTalk API Service Locally

使用 API service 之前先建立 Cloudinary 帳號。

Install packages

$ cd src

$ pip install -r requirement.txt

Set Cloudinary environment variables

$ export CLOUDINARY_API_KEY='123451234512345'
$ export CLOUDINARY_API_SECRET='AsdfghjklAsdfghjklAsdfghjkl'
$ export CLOUDINARY_CLOUD_NAME='mycloud123'

Run API service locally

$ uvicorn main:app --host 0.0.0.0 --port 8080 bash

Run MakeItTalk API Service in Docker

Use the image makeittalk:latest built in previous part.

$ docker run --gpus all -p 127.0.0.1:8080:80 -it makeittalk:latest bash

root@9b3825bb8d5d:/work/src$ export CLOUDINARY_API_KEY='123451234512345'
root@9b3825bb8d5d:/work/src$ export CLOUDINARY_API_SECRET='AsdfghjklAsdfghjklAsdfghjkl'
root@9b3825bb8d5d:/work/src$ export CLOUDINARY_CLOUD_NAME='mycloud123'
root@9b3825bb8d5d:/work/src$ uvicorn main:app --host 0.0.0.0 --port 80

API Usage

URL	HTTP Method	Request	Response
/audio2vid	POST	{ "audio": String, "image": String }	{ "output_url": String }

API Usage Example

// sample input
{
  "audio": "https://your.audio/audio.wav",
  "image": "https://your.image/image.jpg"
}
// sample output
{
  "output_url": "http://res.cloudinary.com/mycloud123/video/upload/v0123456789/makeittalk-outputs/rhuifh83hf4xnf8944j3.mp4"
}

License

Acknowledgement

We would like to thank Timothy Langlois for the narration, and Kaizhi Qian for the help with the voice conversion module. We thank Jakub Fiser for implementing the real-time GPU version of the triangle morphing algorithm. We thank Daichi Ito for sharing the caricature image and Dave Werner for Wilk, the gruff but ultimately lovable puppet.

This research is partially funded by NSF (EAGER-1942069) and a gift from Adobe. Our experiments were performed in the UMass GPU cluster obtained under the Collaborative Fund managed by the MassTech Collaborative.

livingbio/MakeItTalk