/MakeItTalk

Primary LanguageJupyter Notebook

Demo: MakeItTalk

Project page | Paper | Video | Arxiv

img

輸出範例

限制

Technical Limitation

  • 最大輸出解析度:256x256
  • 最大時長:無限制
  • 輸出影片幀率:22 fps

Quality

  • 輸入中文嘴型準確度較差,推測是跟訓練模型用的dataset有關
  • 背景內容較複雜時容易看出扭曲
  • 正臉效果較佳,推測與訓練用dataset跟人臉標記重疊有關

適用場合

  • 正臉,臉部特徵(特別是嘴形)容易辨識、背景簡單
  • 新聞播報、線上教學、演說、訪談等人只需露出肩部以上且沒有太大移動的說話情境

測試硬體規格

  • OS:Ubuntu 20.04
  • GPU:NVIDIA GeForce® RTX 2070 SUPER
  • GPU Memory:8 GB
  • Memory:32 GB
  • CUDA Driver:470.82.00
  • CUDA:11.4
  • cuDNN:8.x.x
  • Python version(local):3.8.10
  • Python version(docker):3.6.9

實際硬體用量

  • 模型大小(GPU):986.6 MB
  • GPU 記憶體用量:5 GB(4877 MB)
  • Inference 所需時間
    輸入聲音長度 花費時間
    1 4s 11.7s
    2 8s 17.6s
    3 9s 17.9s
    4 17s 27.8s

Pre-trained Models

Download the following pre-trained models to examples/ckpt folder for testing your own animation.

Model Link to the model
Voice Conversion Link
Speech Content Module Link
Speaker-aware Module Link
Image2Image Translation Module Link
Non-photorealistic Warping (.exe) Link

安裝方法 (docker)

  • 在 host 上安裝 Docker and NVIDIA Container Toolkit 以使用 GPU-enabled docker.
  • 如果欲使用 CUDA 和 GPU,確認 host 的 CUDA driver 是否支援 CUDA 11.1,如果不支援,可以使用較低版本的 CUDA image,另外修改安裝的 torch 版本。
  • Build the image with docker build -t makeittalk:latest .
  • Run the container with docker run -it --gpus all makeittalk:latest bash

Test Model

  • 進入 ./src
  • 將圖片(256*256)放入 src/examples
  • 將音檔放入 src/examples(讓 src/examples 內只有一個.wav檔)
  • Run python main_end2end.py --jpg <portrait_file.jpg> .

MakeItTalk API Service

Run MakeItTalk API Service Locally

  • 使用 API service 之前先建立 Cloudinary 帳號。

  • Install packages

    $ cd src
    
    $ pip install -r requirement.txt
  • Set Cloudinary environment variables

    $ export CLOUDINARY_API_KEY='123451234512345'
    $ export CLOUDINARY_API_SECRET='AsdfghjklAsdfghjklAsdfghjkl'
    $ export CLOUDINARY_CLOUD_NAME='mycloud123'
  • Run API service locally

    $ uvicorn main:app --host 0.0.0.0 --port 8080 bash

Run MakeItTalk API Service in Docker

  • Use the image makeittalk:latest built in previous part.

    $ docker run --gpus all -p 127.0.0.1:8080:80 -it makeittalk:latest bash
    
    root@9b3825bb8d5d:/work/src$ export CLOUDINARY_API_KEY='123451234512345'
    root@9b3825bb8d5d:/work/src$ export CLOUDINARY_API_SECRET='AsdfghjklAsdfghjklAsdfghjkl'
    root@9b3825bb8d5d:/work/src$ export CLOUDINARY_CLOUD_NAME='mycloud123'
    root@9b3825bb8d5d:/work/src$ uvicorn main:app --host 0.0.0.0 --port 80

API Usage

URL HTTP Method Request Response
/audio2vid POST { "audio": String, "image": String } { "output_url": String }

API Usage Example

// sample input
{
  "audio": "https://your.audio/audio.wav",
  "image": "https://your.image/image.jpg"
}
// sample output
{
  "output_url": "http://res.cloudinary.com/mycloud123/video/upload/v0123456789/makeittalk-outputs/rhuifh83hf4xnf8944j3.mp4"
}

Acknowledgement

We would like to thank Timothy Langlois for the narration, and Kaizhi Qian for the help with the voice conversion module. We thank Jakub Fiser for implementing the real-time GPU version of the triangle morphing algorithm. We thank Daichi Ito for sharing the caricature image and Dave Werner for Wilk, the gruff but ultimately lovable puppet.

This research is partially funded by NSF (EAGER-1942069) and a gift from Adobe. Our experiments were performed in the UMass GPU cluster obtained under the Collaborative Fund managed by the MassTech Collaborative.