This repo is modified from glid-3-xl. Aesthetic CLIP embeds are provided by aesthetic-predictor
Please ensure the following dependencies are installed prior to building this repo:
- software-properties-common
- build-essential
- libopenmpi-dev
- liblzma-dev
- libnss3-dev
- zlib1g-dev
- libgdbm-dev
- libncurses5-dev
- libssl-dev
- libffi-dev
- libbz2-dev
It's a good idea to use a virtual environment or a conda environment.
python3 -m venv .venv
source venv/bin/activate
(venv) $
Before installing, you should install pytorch manually by following the instructions at pytorch.org
In my instance, I needed the following for cuda 11.3.
(venv) $ pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/cu113
To check your cuda version, run nvidia-smi
.
You can now install this repo by running pip install -e .
in the project directory.
(venv) $ git clone https://github.com/laion-ai/ldm-finetune.git
(venv) $ cd ldm-finetune
(venv) $ pip install -e .
wget https://dall-3.com/models/glid-3-xl/bert.pt
wget https://dall-3.com/models/glid-3-xl/kl-f8.pt
There are several stage 2 checkpoints to choose from:
The original checkpoint from CompVis trained on LAION-400M
.
wget https://dall-3.com/models/glid-3-xl/diffusion.pt
The first finetune from jack000's glid-3-xl. Modified to accept a CLIP text embed and finetuned on curated data to help with watermarks. Doesn't support inpainting.
wget https://dall-3.com/models/glid-3-xl/finetune.pt
This second finetune adds support for inpainting and can be used for unconditional output as well by setting the inpaint image_embed
to zeros.
wget https://dall-3.com/models/glid-3-xl/inpaint.pt
erlich
is inpaint.pt finetuned on a dataset collected from LAION-5B named Large Logo Dataset
. It consists of roughly 100K images of logos with captions generated via BLIP using aggressive re-ranking and filtering.
wget -O erlich.pt https://huggingface.co/laion/erlich/resolve/main/model/ema_0.9999_120000.pt
ONGO is inpaint.pt finetuned on the Wikiart dataset consisting of about 100K paintings with captions generated via BLIP using aggressive re-ranking and filtering. We also make use of the original captions which contain the author name and the painting title.
wget https://huggingface.co/laion/ongo/resolve/main/ongo.pt
# fast PLMS sampling
(venv) $ python sample.py --model_path erlich.pt --batch_size 6 --num_batches 6 --text "a cyberpunk girl with a scifi neuralink device on her head"
# classifier free guidance + CLIP guidance (better adherence to prompt, much slower)
(venv) $ python sample.py --clip_guidance --model_path finetune.pt --batch_size 1 --num_batches 12 --text "a cyberpunk girl with a scifi neuralink device on her head | trending on artstation"
# sample with an init image
(venv) $ python sample.py --init_image picture.jpg --skip_timesteps 10 --model_path ongo.pt --batch_size 6 --num_batches 6 --text "a cyberpunk girl with a scifi neuralink device on her head"
aka human guided diffusion. You can use inpainting to generate more complex prompts by progressively editing the image
note: you can use > 256px but the model only sees 256x256 at a time, so ensure the inpaint area is smaller than that
# install PyQt5 if you want to use a gui, otherwise supply a mask file
(venv) $ pip install PyQt5
# this will pop up a window, use your mouse to paint
# use the generated npy files instead of png for best quality
(venv) $ python sample.py --model_path inpaint.pt --edit output_npy/00000.npy --batch_size 6 --num_batches 6 --text "your prompt"
# after painting, the mask is saved for re-use
(venv) $ python sample.py --mask mask.png --model_path inpaint.pt --edit output_npy/00000.npy --batch_size 6 --num_batches 6 --text "your prompt"
# additional arguments for uncropping
(venv) $ python sample.py --edit_x 64 --edit_y 64 --edit_width 128 --edit_height 128 --model_path inpaint.pt --edit output_npy/00000.npy --batch_size 6 --num_batches 6 --text "your prompt"
## Autoedit
# autoedit uses the inpaint model to give the ldm an image prompting function (that works differently from --init_image)
# it continuously edits random parts of the image to maximize clip score for the text prompt
(venv) $ python autoedit.py --edit image.png --model_path inpaint.pt --batch_size 6 --text "your prompt"
# batch size > 1 required
MODEL_FLAGS="--dropout 0.1 --ema_rate 0.9999 --attention_resolutions 32,16,8 --class_cond False --diffusion_steps 1000 --image_size 32 --learn_sigma False --noise_schedule linear --num_channels 320 --num_heads 8 --num_res_blocks 2 --resblock_updown False --use_fp16 True --use_scale_shift_norm False"
TRAIN_FLAGS="--lr --batch_size 64 --microbatch 1 --log_interval 1 --save_interval 5000 --kl_model kl-f8.pt --bert_model bert.pt --resume_checkpoint diffusion.pt"
export OPENAI_LOGDIR=./logs/
export TOKENIZERS_PARALLELISM=false
python scripts/image_train_inpaint.py --data_dir /path/to/data $MODEL_FLAGS $TRAIN_FLAGS