What settings to use for 16GB VRAM LoRA (SD3.5-Large, Flux)?

Question

What settings to use for 16GB VRAM LoRA (SD3.5-Large, Flux)?

cthu1hoo opened this issue 2 months ago · 7 comments

Quickstart guide for SD3.5 mentions lowest VRAM config (bnb-nf4, bnb-lion8bit-paged optimizer) which seems to consume less than 10G VRAM while training. Default settings, however (adamw_bf16, int8-quanto) immediately OOM on 16GB.

Some kind of middle-ground setting for 16GB VRAM cards would be much appreciated. :)

Answer 1 · 2024-11-13T12:40:20.000Z

there really isnt one. use nf4 too.

Answer 2 · 2024-11-13T16:18:30.000Z

somewhat related, but maybe this would help someone else: int8-quanto works just fine on 16G GPU but you need to set --quantize_via=cpu manually, this option seems to default to cuda device and causes the transformer to be loaded in the poor graphics card in bf16 weights, naturally it doesn't fit and OOMs.

it would be great if this would be mentioned explicitly in the flux quickstart somewhere, I had to do some very mild source digging to figure it out.

Answer 3 · 2024-11-13T16:27:13.000Z

it is 🙈

### Crashing
- If you get SIGKILL after the text encoders are unloaded, this means you do not have enough system memory to quantise Flux.
  - Try loading the `--base_model_precision=bf16` but if that does not work, you might just need more memory..
  - Try `--quantize_via=accelerator` to use the GPU instead

Answer 4 · 2024-11-13T16:28:41.000Z

well, described more accurately. in OPTIONS.md or train.py --help:

    parser.add_argument(
        "--quantize_via",
        type=str,
        choices=["cpu", "accelerator"],
        default="accelerator",
        help=(
            "When quantising the model, the quantisation process can be done on the CPU or the accelerator."
            " When done on the accelerator (default), slightly more VRAM is required, but the process completes in milliseconds."
            " When done on the CPU, the process may take upwards of 60 seconds, but can complete without OOM on 16G cards."
        ),
    )

Answer 5 · 2024-11-13T16:29:28.000Z

i know, i've found this paragraph you're quoting. it says to use gpu when there's not enough ram, but it doesn't tell you to use cpu, and it is not used by default. if cpu was used by default, i wouldn't post this suggestion.

e: anyway, thanks for simple tuner, so far i like it much more than kohya-ss. i'm glad that there are smart people out there who understand all this stuff so we unwashed plebs can make funny pictures. :)

Answer 6 · 2024-11-13T16:35:37.000Z

i think it actually used to detect GPUs <= 16G and would force the quantisation to happen on the CPU but it wasn't very reliable

Answer 7 · 2024-11-13T16:40:39.000Z

this makes much more sense, it's unfortunate that it didn't work out.

maybe this option could be added to the code block here - https://github.com/bghira/SimpleTuner/blob/main/documentation/quickstart/FLUX.md#quantised-model-training - since it's all about 16G training?

edit: oh, you've already updated the docs, thanks!