Dynamic Typography: Bringing Text to Life via Video Diffusion Prior

Dynamic Typography: Bringing Text to Life via Video Diffusion Prior

Zichen Liu*, Yihao Meng*, Hao Ouyang, Yue Yu, Bolin Zhao, Daniel Cohen-Or, Huamin Qu

* Denotes equal contribution

We present an automated text animation scheme, termed "Dynamic Typography," which combines two challenging tasks. It deforms letters to convey semantic meaning and infuses them with vibrant movements based on user prompts.

Strongly recommend to see our demo page.

Setup

git clone https://github.com/zliucz/animate-your-word.git
cd animate-your-word

Environment

To set up our environment in Linux, please run:

conda env create -f environment.yml

Next, you need to install diffvg:

conda activate dTypo
git clone https://github.com/BachiLi/diffvg.git
cd diffvg
git submodule update --init --recursive
python setup.py install

Generate Your Animation!

To animate a letter within a word, run the following command:

CUDA_VISIBLE_DEVICES=0 python dynamicTypography.py \
        --word "<The Word>" \
        --optimized_letter "<The letter to be animated>" \
        --caption "<The prompt that describes the animation>" \
        --use_xformer --canonical --anneal \
        --use_perceptual_loss --use_conformal_loss  \
        --use_transition_loss

For example:

CUDA_VISIBLE_DEVICES=0 python dynamicTypography.py \
        --word "father" --optimized_letter "h" \
        --caption "A tall father walks along the road, holding his little son with his hand" \
        --use_xformer --canonical --anneal \
        --use_perceptual_loss --use_conformal_loss \
        --use_transition_loss

CUDA_VISIBLE_DEVICES=0 python dynamicTypography.py \
        --word "PASSION" --optimized_letter "N" \
        --caption "Two people kiss each other, one holding the others chin with his hand" \
        --use_xformer --canonical --anneal \
        --use_perceptual_loss --use_conformal_loss  \
        --use_transition_loss --schedule_rate 5.0

The output animation will be saved to "videos".
The output includes the network's weights, SVG frame logs and their rendered .mp4 files (under svg_logs and mp4_logs respectively).
We save both the in-context and the sole letter animation.
At the end of training, we output a high quality gif render of the last iteration (HG_gif.gif).

We provide many example run scripts in scripts, the expected resulting gifs are in example_gifs. More results can be found on our project page.

By default, a 24-frame video will be generated, requiring about 28GB of VRAM. If there is not enough VRAM available, the number of frames can be reduced by using the --num_frames parameter.

Tips:

If your animation remains the same with the original letter's shape or deviate too much from the original letter shape, please set a lower/higher --perceptual_weight.

If your want the animation too be less/more geometrically similar to the original letter, please set a lower/higher --angles_w.

If you want to further enforce appearance consistency between frames, please set a higher --transition_weight. But please note that this will reduce the motion amplitude.

Small visual artifacts can often be fixed by changing the --seed.

Citation:

Don't forget to cite this source if it proves useful in your research!

@article{liu2024dynamic, 
	title={Dynamic Typography: Bringing Text to Life via Video Diffusion Prior}, 
	author={Zichen Liu and Yihao Meng and Hao Ouyang and Yue Yu and Bolin Zhao and Daniel Cohen-Or and Huamin Qu}, 
	year={2024}, 
	eprint={2404.11614}, 
	archivePrefix={arXiv}, 
	primaryClass={cs.CV}}

Acknowledgment:

Our implementation is based on word-as-image and live-sketch. Thanks for their remarkable contribution and released code.