FoleyCrafter

Sound effects are the unsung heroes of cinema and gaming, enhancing realism, impact, and emotional depth for an immersive audiovisual experience. FoleyCrafter is a video-to-audio generation framework which can produce realistic sound effects semantically relevant and synchronized with videos.

Your star is our fuel! We're revving up the engines with it!

FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds

Yiming Zhang, Yicheng Gu, Yanhong Zeng†, Zhening Xing, Yuancheng Wang, Zhizheng Wu, Kai Chen†

(†Corresponding Author)

What's New

A more powerful one 😝 .
Release training code.
2024/07/01 Release the model and code of FoleyCrafter.

Setup

Prepare Environment

Use the following command to install dependencies:

# install conda environment
conda env create -f requirements/environment.yaml
conda activate foleycrafter

# install GIT LFS for checkpoints download
conda install git-lfs
git lfs install

Download Checkpoints

The checkpoints will be downloaded automatically by running inference.py.

You can also download manually using following commands.

Download the text-to-audio base model. We use Auffusion

git clone https://huggingface.co/auffusion/auffusion-full-no-adapter checkpoints/auffusion

Download FoleyCrafter

git clone https://huggingface.co/ymzhang319/FoleyCrafter checkpoints/

Put checkpoints as follows:

└── checkpoints
    ├── semantic
    │   ├── semantic_adapter.bin
    ├── vocoder
    │   ├── vocoder.pt
    │   ├── config.json
    ├── temporal_adapter.ckpt
    │   │
    └── timestamp_detector.pth.tar

Gradio demo

You can launch the Gradio interface for FoleyCrafter by running the following command:

python app.py --share

Inference

Video To Audio Generation

python inference.py --save_dir=output/sora/

Results:

Input Video	Generated Audio
0.mp4	0.mp4
1.mp4	1.mp4
2.mp4	2.mp4
3.mp4	3.mp4

Temporal Alignment with Visual Cues

python inference.py \
--temporal_align \
--input=input/avsync \
--save_dir=output/avsync/

Results:

Ground Truth	Generated Audio
0.mp4	0.mp4
1.mp4	1.mp4
2.mp4	2.mp4

Text-based Video to Audio Generation

Using Prompt

# case1
python inference.py \
--input=input/PromptControl/case1/ \
--seed=10201304011203481429 \
--save_dir=output/PromptControl/case1/

python inference.py \
--input=input/PromptControl/case1/ \
--seed=10201304011203481429 \
--prompt='noisy, people talking' \
--save_dir=output/PromptControl/case1_prompt/

# case2
python inference.py \
--input=input/PromptControl/case2/ \
--seed=10021049243103289113 \
--save_dir=output/PromptControl/case2/

python inference.py \
--input=input/PromptControl/case2/ \
--seed=10021049243103289113 \
--prompt='seagulls' \
--save_dir=output/PromptControl/case2_prompt/

Results:

Generated Audio	Generated Audio
Without Prompt	Prompt: noisy, people talking
0.mp4	0.mp4
Without Prompt	Prompt: seagulls
0.mp4	0.mp4

Using Negative Prompt

# case 3
python inference.py \
--input=input/PromptControl/case3/ \
--seed=10041042941301238011 \
--save_dir=output/PromptControl/case3/

python inference.py \
--input=input/PromptControl/case3/ \
--seed=10041042941301238011 \
--nprompt='river flows' \
--save_dir=output/PromptControl/case3_nprompt/

# case4
python inference.py \
--input=input/PromptControl/case4/ \
--seed=10014024412012338096 \
--save_dir=output/PromptControl/case4/

python inference.py \
--input=input/PromptControl/case4/ \
--seed=10014024412012338096 \
--nprompt='noisy, wind noise' \
--save_dir=output/PromptControl/case4_nprompt/

Results:

Generated Audio	Generated Audio
Without Prompt	Negative Prompt: river flows
0.mp4	0.mp4
Without Prompt	Negative Prompt: noisy, wind noise
0.mp4	0.mp4

Commandline Usage Parameters

options:
  -h, --help            show this help message and exit
  --prompt PROMPT       prompt for audio generation
  --nprompt NPROMPT     negative prompt for audio generation
  --seed SEED           ramdom seed
  --temporal_align TEMPORAL_ALIGN
                        use temporal adapter or not
  --temporal_scale TEMPORAL_SCALE
                        temporal align scale
  --semantic_scale SEMANTIC_SCALE
                        visual content scale
  --input INPUT         input video folder path
  --ckpt CKPT           checkpoints folder path
  --save_dir SAVE_DIR   generation result save path
  --pretrain PRETRAIN   generator checkpoint path
  --device DEVICE

BibTex

@misc{zhang2024pia,
  title={FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds},
  author={Yiming Zhang, Yicheng Gu, Yanhong Zeng, Zhening Xing, Yuancheng Wang, Zhizheng Wu, Kai Chen},
  year={2024},
  eprint={2407.01494},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}

Contact Us

Yiming Zhang: zhangyiming@pjlab.org.cn

YiCheng Gu: yichenggu@link.cuhk.edu.cn

Yanhong Zeng: zengyanhong@pjlab.org.cn

LICENSE

Please check LICENSE for the part of FoleyCrafter for details. If you are using it for commercial purposes, please check the license of the Auffusion.

Acknowledgements

The code is built upon Auffusion, CondFoleyGen and SpecVQGAN.

We recommend a toolkit for Audio, Music, and Speech Generation Amphion 💝.

open-mmlab/FoleyCrafter

FoleyCrafter

What's New

Setup

Prepare Environment

Download Checkpoints

Gradio demo

Inference

Video To Audio Generation

Text-based Video to Audio Generation

Commandline Usage Parameters

BibTex

Contact Us

LICENSE

Acknowledgements