Tailor3D: Customized 3D Assets Editing and Generation with Dual-Side Images

Zhangyang Qi^1,5*, Yunhan Yang^1*, Mengchen Zhang^2,5, Long Xing^3,5, Xiaoyang Wu¹, Tong Wu^4,5, Dahua Lin^4,5, Xihui Liu¹, Jiaqi Wang⁵✉, Hengshuang Zhao¹✉

^* Equation Contribution, ^✉ Corresponding Authors,

¹ The University of Hong Kong, ² Zhejiang University, ³ University of Science and Technology of China,
⁴ The Chinese University of Hong Kong, ⁵ Shanghai AI Laboratory,

🔥 News

[2024.07.03] We release the arxiv paper, code v1.0, project page, model weight and the dataset card of Tailor3D. Note that in v1.0, the fusion front and back triplane way is the conv2d.

🏠 Overview (Pipeline and Video)

Pipeline

We propose Tailor3D, a novel pipeline creating customized 3D assets from editable dual-side images and feed-forward reconstruction methods. This approach mimics a tailor's local object changes and style transfers:

Use image editing methods to edit the front-view image. The front-view image can be provided or generated from text.
Use multi-view diffusion techniques (e.g., Zero-1-to-3) to generate the back view of the object.
Use image editing methods to edit the back-view image.
Use our proposed Dual-sided LRM (large reconstruction model) to seamlessly combine front and back images and get the customized 3D asset.

Each step takes only a few seconds, allowing users to interactively obtain the 3D objects they desire. Experimental results show Tailor3D's effectiveness in 3D generative fill and style transfer, providing an efficient solution for 3D asset editing.

Tutorial Videos

🔧 Setup

Installation

conda create -n tailor3d python=3.11
conda activate tailor3d

conda install pytorch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 pytorch-cuda=11.8 -c pytorch -c nvidia

pip install -r requirements.txt

🚀 Inference

1. Pretrained Models

Model weights are released on Hugging Face and weights will be downloaded automatically when you run the inference script for the first time.

Model	Pretained Model	Layers	Feat. Dim	Trip. Dim.	In. Res.	Image Encoder
tailor3d-small-1.0	openlrm-mix-small-1.1	12	512	32	224	dinov2_vits14_reg
tailor3d-base-1.0	openlrm-mix-base-1.1	12	768	48	336	dinov2_vitb14_reg
tailor3d-large-1.0	openlrm-mix-large-1.1	16	1024	80	448	dinov2_vitb14_reg

2. Prepare Images

We put some sample inputs under assets/sample_input/demo. Note that the folders "front" and "back" should respectively contain edited images of the object's frontal view and rear view.
Prepare RGBA images or RGB images with white background (with some background removal tools, e.g., Rembg, Clipdrop).

3. Inference

An example usage of tatlor3d-large is as follows:

# Example usage
EXPORT_VIDEO=true
EXPORT_MESH=true
DOUBLE_SIDED=true
HUGGING_FACE=true

INFER_CONFIG="./configs/all-large-2sides.yaml"
MODEL_NAME="alexzyqi/tailor3d-large-1.0"
PRETRAIN_MODEL_HF="zxhezexin/openlrm-mix-large-1.1"
IMAGE_INPUT="./assets/sample_input/demo"

python -m openlrm.launch infer.lrm --infer $INFER_CONFIG model_name=$MODEL_NAME pretrain_model_hf=$PRETRAIN_MODEL_HF image_input=$IMAGE_INPUT export_video=$EXPORT_VIDEO export_mesh=$EXPORT_MESH double_sided=$DOUBLE_SIDED inferrer.hugging_face=$HUGGING_FACE

You may specify which form of output to generate by setting the flags EXPORT_VIDEO=true and EXPORT_MESH=true.
When DOUBLE_SIDED=true, Tailor3D uses both front and back view images as input; when set to false, it degenerates to LRM.
When HUGGING_FACE=true, the Hugging Face model is used for inference; if set to false, inference will be conducted to evaluate the common training process.
INFER_CONFIG is the config for both training and inference, you can change to small and tiny version.
MODEL_NAME corresponds to the model version specified in INFER_CONFIG. PRETRAIN_MODEL_HF corresponds to the pretrained model used in OpenLRM."

🏗️ Training

1. Data Preparation

We use gobjaverse-lvis dataset. You can get access through the hugging face dataset card.

2. Accelerate Training Configuration

configs/accelerate-train.yaml: a sample accelerate config file to use 8 GPUs with bf16 mixed precision.
configs/accelerate-train-4gpus.yaml: a sample accelerate config file to use 4 GPUs with bf16 mixed precision.

3. Run Training

An example training usage of tatlor3d-large is as follows:

# Example usage
ACC_CONFIG="./configs/accelerate-train.yaml"
TRAIN_CONFIG="./configs/all-large-2sides.yaml"

accelerate launch --config_file $ACC_CONFIG -m openlrm.launch train.lrm --config $TRAIN_CONFIG

A sample training config file is provided under TRAIN_CONFIG, training and inference configs are all in the same config yaml file.

➕ Optional: Inference on Your Custom Trained Models (From OpenLRM)

1. Convert your trained model to hugging face release format.

The inference pipeline is compatible with huggingface utilities for better convenience.
You need to convert the training checkpoint to inference models by running the following script.
```
python scripts/convert_hf.py --config configs/all-large-2sides.yaml
```
The converted model will be saved under exps/releases by default and can be used for inference following the inference.
Note: In this way the model.safetensors have the full parameters.

Model	Model Size	Model Size (with pretrained model)
tailor3d-small-1.0	17.4 MB	436 MB
tailor3d-base-1.0	26.8 MB	1.0 GB
tailor3d-large-1.0	45 MB	1.8 GB

2. Upload the release format to the hugging face model card.

python scripts/upload_hub.py --model_type lrm --local_ckpt exps/releases/gobjaverse-2sides-large/0428_conv_e10/step_013340 --repo_id alexzyqi/Tailor3D-Large-1.0

Note that you should change --local_ckpt and --repo_id to your own.

📚 Related Work

3D Object Dataset

Gobjaverse: This is rendered by making two complete rotations around the object in Objaverse (28K 3D Objects).
LVIS: It is Large Vocabulary Instance Segmentation (LVIS) dataset which focus on objects. Gobjaverse-LVIS have 21445 objects in total.

Text-to-Image Tools

Stable Diffusion, Midjourney,: Text-to-Image tools to generate the front-view image.

Image Editing Tools

Adobe Express: Image generative geometry and pattern fill.
InstantStyle: Image customized styling.
ControlNet: Image style Transfer.

Multi-view Diffusion Tools

Stable Zero123: 3D-aware Multi-view Diffusion method to generate the back-view image based on the front image.

3D Feed-Forward Reconstuction Methods

Large Reconstuction Model: The paper and the open source code OpenLRM.

🔗 Citation

If you find this work useful for your research, please consider citing:

@misc{qi2024tailor3dcustomized3dassets,
      title={Tailor3D: Customized 3D Assets Editing and Generation with Dual-Side Images}, 
      author={Zhangyang Qi and Yunhan Yang and Mengchen Zhang and Long Xing and Xiaoyang Wu and Tong Wu and Dahua Lin and Xihui Liu and Jiaqi Wang and Hengshuang Zhao},
      year={2024},
      eprint={2407.06191},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2407.06191}, 
}

Qi-Zhangyang/Tailor3D