/FoleyCrafter

FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds. AI拟音大师,给你的无声视频添加生动而且同步的音效 😝

Primary LanguagePythonApache License 2.0Apache-2.0

arXiv Project Page Open in HugginFace HuggingFace Model Open in OpenXLab

FoleyCrafter

Sound effects are the unsung heroes of cinema and gaming, enhancing realism, impact, and emotional depth for an immersive audiovisual experience. FoleyCrafter is a video-to-audio generation framework which can produce realistic sound effects semantically relevant and synchronized with videos.

Your star is our fuel! We're revving up the engines with it!

FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds

Yiming Zhang, Yicheng Gu, Yanhong Zeng†, Zhening Xing, Yuancheng Wang, Zhizheng Wu, Kai Chen†

(†Corresponding Author)

What's New

  • A more powerful one 😝 .
  • Release training code.
  • 2024/07/01 Release the model and code of FoleyCrafter.

Setup

Prepare Environment

Use the following command to install dependencies:

# install conda environment
conda env create -f requirements/environment.yaml
conda activate foleycrafter

# install GIT LFS for checkpoints download
conda install git-lfs
git lfs install

Download Checkpoints

The checkpoints will be downloaded automatically by running inference.py.

You can also download manually using following commands.

  • Download the text-to-audio base model. We use Auffusion
  • git clone https://huggingface.co/auffusion/auffusion-full-no-adapter checkpoints/auffusion
  • Download FoleyCrafter
  • git clone https://huggingface.co/ymzhang319/FoleyCrafter checkpoints/

    Put checkpoints as follows:

    └── checkpoints
        ├── semantic
        │   ├── semantic_adapter.bin
        ├── vocoder
        │   ├── vocoder.pt
        │   ├── config.json
        ├── temporal_adapter.ckpt
        │   │
        └── timestamp_detector.pth.tar
    

    Gradio demo

    You can launch the Gradio interface for FoleyCrafter by running the following command:

    python app.py --share

    Inference

    Video To Audio Generation

    python inference.py --save_dir=output/sora/

    Results:

    Input Video

    Generated Audio

    0.mp4
    0.mp4
    1.mp4
    1.mp4
    2.mp4
    2.mp4
    3.mp4
    3.mp4
    • Temporal Alignment with Visual Cues
    python inference.py \
    --temporal_align \
    --input=input/avsync \
    --save_dir=output/avsync/

    Results:

    Ground Truth

    Generated Audio

    0.mp4
    0.mp4
    1.mp4
    1.mp4
    2.mp4
    2.mp4

    Text-based Video to Audio Generation

    • Using Prompt
    # case1
    python inference.py \
    --input=input/PromptControl/case1/ \
    --seed=10201304011203481429 \
    --save_dir=output/PromptControl/case1/
    
    python inference.py \
    --input=input/PromptControl/case1/ \
    --seed=10201304011203481429 \
    --prompt='noisy, people talking' \
    --save_dir=output/PromptControl/case1_prompt/
    
    # case2
    python inference.py \
    --input=input/PromptControl/case2/ \
    --seed=10021049243103289113 \
    --save_dir=output/PromptControl/case2/
    
    python inference.py \
    --input=input/PromptControl/case2/ \
    --seed=10021049243103289113 \
    --prompt='seagulls' \
    --save_dir=output/PromptControl/case2_prompt/

    Results:

    Generated Audio

    Generated Audio

    Without Prompt

    Prompt: noisy, people talking

    0.mp4
    0.mp4

    Without Prompt

    Prompt: seagulls

    0.mp4
    0.mp4
    • Using Negative Prompt
    # case 3
    python inference.py \
    --input=input/PromptControl/case3/ \
    --seed=10041042941301238011 \
    --save_dir=output/PromptControl/case3/
    
    python inference.py \
    --input=input/PromptControl/case3/ \
    --seed=10041042941301238011 \
    --nprompt='river flows' \
    --save_dir=output/PromptControl/case3_nprompt/
    
    # case4
    python inference.py \
    --input=input/PromptControl/case4/ \
    --seed=10014024412012338096 \
    --save_dir=output/PromptControl/case4/
    
    python inference.py \
    --input=input/PromptControl/case4/ \
    --seed=10014024412012338096 \
    --nprompt='noisy, wind noise' \
    --save_dir=output/PromptControl/case4_nprompt/
    

    Results:

    Generated Audio

    Generated Audio

    Without Prompt

    Negative Prompt: river flows

    0.mp4
    0.mp4

    Without Prompt

    Negative Prompt: noisy, wind noise

    0.mp4
    0.mp4

    Commandline Usage Parameters

    options:
      -h, --help            show this help message and exit
      --prompt PROMPT       prompt for audio generation
      --nprompt NPROMPT     negative prompt for audio generation
      --seed SEED           ramdom seed
      --temporal_align TEMPORAL_ALIGN
                            use temporal adapter or not
      --temporal_scale TEMPORAL_SCALE
                            temporal align scale
      --semantic_scale SEMANTIC_SCALE
                            visual content scale
      --input INPUT         input video folder path
      --ckpt CKPT           checkpoints folder path
      --save_dir SAVE_DIR   generation result save path
      --pretrain PRETRAIN   generator checkpoint path
      --device DEVICE

    BibTex

    @misc{zhang2024pia,
      title={FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds},
      author={Yiming Zhang, Yicheng Gu, Yanhong Zeng, Zhening Xing, Yuancheng Wang, Zhizheng Wu, Kai Chen},
      year={2024},
      eprint={2407.01494},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
    }
    

    Contact Us

    Yiming Zhang: zhangyiming@pjlab.org.cn

    YiCheng Gu: yichenggu@link.cuhk.edu.cn

    Yanhong Zeng: zengyanhong@pjlab.org.cn

    LICENSE

    Please check Apache-2.0 license for details.

    Acknowledgements

    The code is built upon Auffusion, CondFoleyGen and SpecVQGAN.

    We recommend a toolkit for Audio, Music, and Speech Generation Amphion 💝.