βΉοΈ No data is sent to any third parties except through opt-in flag
report_to
,push_to_hub
, or webhooks which must be manually configured.
SimpleTuner is geared towards simplicity, with a focus on making the code easily understood. This codebase serves as a shared academic exercise, and contributions are welcome.
If you'd like to join our community, we can be found on Discord via Terminus Research Group. If you have any questions, please feel free to reach out to us there.
- Simplicity: Aiming to have good default settings for most use cases, so less tinkering is required.
- Versatility: Designed to handle a wide range of image quantities - from small datasets to extensive collections.
- Cutting-Edge Features: Only incorporates features that have proven efficacy, avoiding the addition of untested options.
Please fully explore this README before embarking on the tutorial, as it contains vital information that you might need to know first.
For a quick start without reading the full documentation, you can use the Quick Start guide.
For memory-constrained systems, see the DeepSpeed document which explains how to use π€Accelerate to configure Microsoft's DeepSpeed for optimiser state offload.
For multi-node distributed training, this guide will help tweak the configurations from the INSTALL and Quickstart guides to be suitable for multi-node training, and optimising for image datasets numbering in the billions of samples.
SimpleTuner provides comprehensive training support across multiple diffusion model architectures with consistent feature availability:
- Multi-GPU training - Distributed training across multiple GPUs with automatic optimization
- Advanced caching - Image, video, and caption embeddings cached to disk for faster training
- Aspect bucketing - Support for varied image/video sizes and aspect ratios
- Memory optimization - Most models trainable on 24G GPU, many on 16G with optimizations
- DeepSpeed integration - Train large models on smaller GPUs with gradient checkpointing and optimizer state offload
- S3 training - Train directly from cloud storage (Cloudflare R2, Wasabi S3)
- EMA support - Exponential moving average weights for improved stability and quality
Model | Parameters | PEFT LoRA | Lycoris | Full-Rank | ControlNet | Quantization | Flow Matching | Text Encoders |
---|---|---|---|---|---|---|---|---|
Stable Diffusion XL | 3.5B | β | β | β | β | int8/nf4 | β | CLIP-L/G |
Stable Diffusion 3 | 2B-8B | β | β | β* | β | int8/fp8/nf4 | β | CLIP-L/G + T5-XXL |
Flux.1 | 12B | β | β | β* | β | int8/fp8/nf4 | β | CLIP-L + T5-XXL |
Auraflow | 6.8B | β | β | β* | β | int8/fp8/nf4 | β | UMT5-XXL |
PixArt Sigma | 0.6B-0.9B | β | β | β | β | int8 | β | T5-XXL |
Sana | 0.6B-4.8B | β | β | β | β | int8 | β | Gemma2-2B |
Lumina2 | 2B | β | β | β | β | int8 | β | Gemma2 |
Kwai Kolors | 5B | β | β | β | β | β | β | ChatGLM-6B |
LTX Video | 5B | β | β | β | β | int8/fp8 | β | T5-XXL |
Wan Video | 1.3B-14B | β | β | β* | β | int8 | β | UMT5 |
HiDream | 17B (8.5B MoE) | β | β | β* | β | int8/fp8/nf4 | β | CLIP-L + T5-XXL + Llama |
Cosmos2 | 2B-14B | β | β | β | β | int8 | β | T5-XXL |
OmniGen | 3.8B | β | β | β | β | int8/fp8 | β | T5-XXL |
Qwen Image | 20B | β | β | β* | β | int8/nf4 (req.) | β | T5-XXL |
SD 1.x/2.x (Legacy) | 0.9B | β | β | β | β | int8/nf4 | β | CLIP-L |
β = Supported, β = Not supported, * = Requires DeepSpeed for full-rank training
- TREAD - Token-wise dropout for transformer models, including Kontext training
- Masked loss training - Superior convergence with segmentation/depth guidance
- Prior regularization - Enhanced training stability for character consistency
- Gradient checkpointing - Configurable intervals for memory/speed optimization
- Loss functions - L2, Huber, Smooth L1 with scheduling support
- SNR weighting - Min-SNR gamma weighting for improved training dynamics
- Flux Kontext - Edit conditioning and image-to-image training for Flux models
- PixArt two-stage - eDiff training pipeline support for PixArt Sigma
- Flow matching models - Advanced scheduling with beta/uniform distributions
- HiDream MoE - Mixture of Experts gate loss augmentation
- T5 masked training - Enhanced fine details for Flux and compatible models
- QKV fusion - Memory and speed optimizations (Flux, Lumina2)
- TREAD integration - Selective token routing for Wan and Flux models
- Classifier-free guidance - Optional CFG reintroduction for distilled models
Detailed quickstart guides are available for all supported models:
- Flux.1 Guide - Includes Kontext editing support and QKV fusion
- Stable Diffusion 3 Guide - Full and LoRA training with ControlNet
- Stable Diffusion XL Guide - Complete SDXL training pipeline
- Auraflow Guide - Flow-matching model training
- PixArt Sigma Guide - DiT model with two-stage support
- Sana Guide - Lightweight flow-matching model
- Lumina2 Guide - 2B parameter flow-matching model
- Kwai Kolors Guide - SDXL-based with ChatGLM encoder
- LTX Video Guide - Video diffusion training
- Wan Video Guide - Video flow-matching with TREAD support
- HiDream Guide - MoE model with advanced features
- Cosmos2 Guide - Multi-modal image generation
- OmniGen Guide - Unified image generation model
- Qwen Image Guide - 20B parameter large-scale training
- NVIDIA: RTX 3080+ recommended (tested up to H200)
- AMD: 7900 XTX 24GB and MI300X verified (higher memory usage vs NVIDIA)
- Apple: M3 Max+ with 24GB+ unified memory for LoRA training
- Large models (12B+): A100-80G for full-rank, 24G+ for LoRA/Lycoris
- Medium models (2B-8B): 16G+ for LoRA, 40G+ for full-rank training
- Small models (<2B): 12G+ sufficient for most training types
Note: Quantization (int8/fp8/nf4) significantly reduces memory requirements. See individual quickstart guides for model-specific requirements.
SimpleTuner can be installed via pip for most users:
# Base installation (CPU-only PyTorch)
pip install simpletuner
# CUDA users (NVIDIA GPUs)
pip install simpletuner[cuda]
# ROCm users (AMD GPUs)
pip install simpletuner[rocm]
# Apple Silicon users (M1/M2/M3/M4 Macs)
pip install simpletuner[apple]
For manual installation or development setup, see the installation documentation.
Enable debug logs for a more detailed insight by adding export SIMPLETUNER_LOG_LEVEL=DEBUG
to your environment (config/config.env
) file.
For performance analysis of the training loop, setting SIMPLETUNER_TRAINING_LOOP_LOG_LEVEL=DEBUG
will have timestamps that highlight any issues in your configuration.
For a comprehensive list of options available, consult this documentation.