Inference of Stable Diffusion in pure C/C++
-
Plain C/C++ implementation based on ggml, working in the same way as llama.cpp
-
Super lightweight and without external dependencies
-
SD1.x, SD2.x and SDXL support
- !!!The VAE in SDXL encounters NaN issues under FP16, but unfortunately, the ggml_conv_2d only operates under FP16. Hence, a parameter is needed to specify the VAE that has fixed the FP16 NaN issue. You can find it here: SDXL VAE FP16 Fix.
-
SD-Turbo and SDXL-Turbo support
-
16-bit, 32-bit float support
-
4-bit, 5-bit and 8-bit integer quantization support
-
Accelerated memory-efficient CPU inference
- Only requires ~2.3GB when using txt2img with fp16 precision to generate a 512x512 image, enabling Flash Attention just requires ~1.8GB.
-
AVX, AVX2 and AVX512 support for x86 architectures
-
Full CUDA and Metal backend for GPU acceleration.
-
Can load ckpt, safetensors and diffusers models/checkpoints. Standalone VAEs models
- No need to convert to
.ggml
or.gguf
anymore!
- No need to convert to
-
Flash Attention for memory usage optimization (only cpu for now)
-
Original
txt2img
andimg2img
mode -
Negative prompt
-
stable-diffusion-webui style tokenizer (not all the features, only token weighting for now)
-
LoRA support, same as stable-diffusion-webui
-
Latent Consistency Models support (LCM/LCM-LoRA)
-
Faster and memory efficient latent decoding with TAESD
-
Upscale images generated with ESRGAN
-
VAE tiling processing for reduce memory usage
-
Sampling method
Euler A
Euler
Heun
DPM2
DPM++ 2M
DPM++ 2M v2
DPM++ 2S a
LCM
-
Cross-platform reproducibility (
--rng cuda
, consistent with thestable-diffusion-webui GPU RNG
) -
Embedds generation parameters into png output as webui-compatible text string
-
Supported platforms
- Linux
- Mac OS
- Windows
- Android (via Termux)
- More sampling methods
- Make inference faster
- The current implementation of ggml_conv_2d is slow and has high memory usage
- Implement Winograd Convolution 2D for 3x3 kernel filtering
- Continuing to reduce memory usage (quantizing the weights of ggml_conv_2d)
- Implement Textual Inversion (embeddings)
- Implement Inpainting support
- k-quants support
git clone --recursive https://github.com/leejet/stable-diffusion.cpp
cd stable-diffusion.cpp
- If you have already cloned the repository, you can use the following command to update the repository to the latest code.
cd stable-diffusion.cpp
git pull origin master
git submodule init
git submodule update
-
download original weights(.ckpt or .safetensors). For example
- Stable Diffusion v1.4 from https://huggingface.co/CompVis/stable-diffusion-v-1-4-original
- Stable Diffusion v1.5 from https://huggingface.co/runwayml/stable-diffusion-v1-5
- Stable Diffuison v2.1 from https://huggingface.co/stabilityai/stable-diffusion-2-1
curl -L -O https://huggingface.co/CompVis/stable-diffusion-v-1-4-original/resolve/main/sd-v1-4.ckpt # curl -L -O https://huggingface.co/runwayml/stable-diffusion-v1-5/resolve/main/v1-5-pruned-emaonly.safetensors # curl -L -O https://huggingface.co/stabilityai/stable-diffusion-2-1/resolve/main/v2-1_768-nonema-pruned.safetensors
mkdir build
cd build
cmake ..
cmake --build . --config Release
cmake .. -DGGML_OPENBLAS=ON
cmake --build . --config Release
This provides BLAS acceleration using the CUDA cores of your Nvidia GPU. Make sure to have the CUDA toolkit installed. You can download it from your Linux distro's package manager (e.g. apt install nvidia-cuda-toolkit
) or from here: CUDA Toolkit. Recommended to have at least 4 GB of VRAM.
cmake .. -DSD_CUBLAS=ON
cmake --build . --config Release
Using Metal makes the computation run on the GPU. Currently, there are some issues with Metal when performing operations on very large matrices, making it highly inefficient at the moment. Performance improvements are expected in the near future.
cmake .. -DSD_METAL=ON
cmake --build . --config Release
Enabling flash attention reduces memory usage by at least 400 MB. At the moment, it is not supported when CUBLAS is enabled because the kernel implementation is missing.
cmake .. -DSD_FLASH_ATTN=ON
cmake --build . --config Release
usage: ./bin/sd [arguments]
arguments:
-h, --help show this help message and exit
-M, --mode [txt2img or img2img] generation mode (default: txt2img)
-t, --threads N number of threads to use during computation (default: -1).
If threads <= 0, then threads will be set to the number of CPU physical cores
-m, --model [MODEL] path to model
--vae [VAE] path to vae
--taesd [TAESD_PATH] path to taesd. Using Tiny AutoEncoder for fast decoding (low quality)
--upscale-model [ESRGAN_PATH] path to esrgan model. Upscale images after generate, just RealESRGAN_x4plus_anime_6B supported by now.
--type [TYPE] weight type (f32, f16, q4_0, q4_1, q5_0, q5_1, q8_0)
If not specified, the default is the type of the weight file.
--lora-model-dir [DIR] lora model directory
-i, --init-img [IMAGE] path to the input image, required by img2img
-o, --output OUTPUT path to write result image to (default: ./output.png)
-p, --prompt [PROMPT] the prompt to render
-n, --negative-prompt PROMPT the negative prompt (default: "")
--cfg-scale SCALE unconditional guidance scale: (default: 7.0)
--strength STRENGTH strength for noising/unnoising (default: 0.75)
1.0 corresponds to full destruction of information in init image
-H, --height H image height, in pixel space (default: 512)
-W, --width W image width, in pixel space (default: 512)
--sampling-method {euler, euler_a, heun, dpm2, dpm++2s_a, dpm++2m, dpm++2mv2, lcm}
sampling method (default: "euler_a")
--steps STEPS number of sample steps (default: 20)
--rng {std_default, cuda} RNG (default: cuda)
-s SEED, --seed SEED RNG seed (default: 42, use random seed for < 0)
-b, --batch-count COUNT number of images to generate.
--schedule {discrete, karras} Denoiser sigma schedule (default: discrete)
--clip-skip N number of layers to skip of clip model (default: 0)
--vae-tiling process vae in tiles to reduce memory usage
-v, --verbose print extra info
You can specify the model weight type using the --type
parameter. The weights are automatically converted when loading the model.
f16
for 16-bit floating-pointf32
for 32-bit floating-pointq8_0
for 8-bit integer quantizationq5_0
orq5_1
for 5-bit integer quantizationq4_0
orq4_1
for 4-bit integer quantization
./bin/sd -m ../models/sd-v1-4.ckpt -p "a lovely cat"
# ./bin/sd -m ../models/v1-5-pruned-emaonly.safetensors -p "a lovely cat"
# ./bin/sd -m ../models/sd_xl_base_1.0.safetensors --vae ../models/sdxl_vae-fp16-fix.safetensors -H 1024 -W 1024 -p "a lovely cat" -v
Using formats of different precisions will yield results of varying quality.
f32 | f16 | q8_0 | q5_0 | q5_1 | q4_0 | q4_1 |
---|---|---|---|---|---|---|
./output.png
is the image generated from the above txt2img pipeline
./bin/sd --mode img2img -m ../models/sd-v1-4.ckpt -p "cat with blue eyes" -i ./output.png -o ./img2img_output.png --strength 0.4
-
You can specify the directory where the lora weights are stored via
--lora-model-dir
. If not specified, the default is the current working directory. -
LoRA is specified via prompt, just like stable-diffusion-webui.
Here's a simple example:
./bin/sd -m ../models/v1-5-pruned-emaonly.safetensors -p "a lovely cat<lora:marblesh:1>" --lora-model-dir ../models
../models/marblesh.safetensors
or ../models/marblesh.ckpt
will be applied to the model
- Download LCM-LoRA form https://huggingface.co/latent-consistency/lcm-lora-sdv1-5
- Specify LCM-LoRA by adding
<lora:lcm-lora-sdv1-5:1>
to prompt - It's advisable to set
--cfg-scale
to1.0
instead of the default7.0
. For--steps
, a range of2-8
steps is recommended. For--sampling-method
,lcm
/euler_a
is recommended.
Here's a simple example:
./bin/sd -m ../models/v1-5-pruned-emaonly.safetensors -p "a lovely cat<lora:lcm-lora-sdv1-5:1>" --steps 4 --lora-model-dir ../models -v --cfg-scale 1
without LCM-LoRA (--cfg-scale 7) | with LCM-LoRA (--cfg-scale 1) |
---|---|
You can use TAESD to accelerate the decoding of latent images by following these steps:
- Download the model weights.
Or curl
curl -L -O https://huggingface.co/madebyollin/taesd/blob/main/diffusion_pytorch_model.safetensors
- Specify the model path using the
--taesd PATH
parameter. example:
sd -m ../models/v1-5-pruned-emaonly.safetensors -p "a lovely cat" --taesd ../models/diffusion_pytorch_model.safetensors
You can use ESRGAN to upscale the generated images. At the moment, only the RealESRGAN_x4plus_anime_6B.pth model is supported. Support for more models of this architecture will be added soon.
- Specify the model path using the
--upscale-model PATH
parameter. example:
sd -m ../models/v1-5-pruned-emaonly.safetensors -p "a lovely cat" --upscale-model ../models/RealESRGAN_x4plus_anime_6B.pth
docker build -t sd .
docker run -v /path/to/models:/models -v /path/to/output/:/output sd [args...]
# For example
# docker run -v ./models:/models -v ./build:/output sd -m /models/sd-v1-4.ckpt -p "a lovely cat" -v -o /output/output.png
precision | f32 | f16 | q8_0 | q5_0 | q5_1 | q4_0 | q4_1 |
---|---|---|---|---|---|---|---|
Memory (txt2img - 512 x 512) | ~2.8G | ~2.3G | ~2.1G | ~2.0G | ~2.0G | ~2.0G | ~2.0G |
Memory (txt2img - 512 x 512) with Flash Attention | ~2.4G | ~1.9G | ~1.6G | ~1.5G | ~1.5G | ~1.5G | ~1.5G |
Thank you to all the people who have already contributed to stable-diffusion.cpp!