An error occurred while writing embeddings to disk. (Flux full fine tune)

Question

An error occurred while writing embeddings to disk. (Flux full fine tune)

Closed this issue 19 days ago · 18 comments

I am trying to do full fine tuning on Flux dev. I followed the quickstart guide. The exact same process works fine for SD 3.5 using the same dataset.

I've also tried different datasets, lora training, on both Runpod and a dedicated server.

But no matter what I do it errors while writing embeds to disk when using Flux.

I've tried ai-toolkit and kohya with the same dataset for flux lora training without issues, which I assume rules out corruption of any files in the dataset.

Write embeds to disk: 51%|████████████████████████████████▍ | 271/534 [00:33<00:30, 8.65it/s]2024-12-10 14:14:18,422 [ERROR] An error occurred while writing embeddings to disk. Traceback (most recent call last): File "/workspace/SimpleTuner/helpers/caching/text_embeds.py", line 241, in batch_write_embeddings self.process_write_batch(batch) File "/workspace/SimpleTuner/helpers/caching/text_embeds.py", line 273, in process_write_batch future.result() # Wait for all writes to complete ^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/concurrent/futures/_base.py", line 456, in result return self.__get_result() ^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result raise self._exception File "/usr/lib/python3.11/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/workspace/SimpleTuner/helpers/data_backend/local.py", line 261, in torch_save os.rename(temp_file_path, filepath) FileNotFoundError: [Errno 2] No such file or directory: '/workspace/SimpleTuner/cache/text/.b43332dd16f7a874aca03f64290d7532-flux.pt.tmp0' -> '/workspace/SimpleTuner/cache/text/b43332dd16f7a874aca03f64290d7532-flux.pt'

Answer 1 · 2024-12-10T15:57:08.000Z

this is a duplicate issue. it is resolved on main branch.

Answer 2 · 2024-12-10T17:52:19.000Z

this is a duplicate issue. it is resolved on main branch.

I deployed a new runpod instance to double check with the main branch for the 5th time or so.

What changes when going to main branch is that the writing to disk error doesn't show. But the end result is basically the same.

Error regarding issues with caching. The "vae-1024" folder is just empty.

Epoch 1/100 Steps: 0%| | 0/106800 [00:00<?, ?it/s](id=my-dataset-1024) Some images were not correctly cached during the VAE Cache operations. Ensure --skip_file_discovery=vae is not set.

Answer 3 · 2024-12-10T22:29:03.000Z

@bghira I assume you didn't get a notification from my response as you immediately closed this issue.

Answer 4 · 2024-12-11T16:42:26.000Z

I did, but I've been busy.

the logic itself I don't think it is the reason for VAE embeds being missing. however, to confirm this, you can try this pull request which uses atomicwrites to handle temp files instead of my prev custom logic.

if the vae embeds are still missing after this, i would recommend to check the logs for WARNING and ERROR messages to see whether there is some hint as to why images are not encoding or getting ignored.

for example using vae_batch_size too large on 24G card with high resolution will OOM.

Answer 5 · 2024-12-11T19:58:05.000Z

I did, but I've been busy.

the logic itself I don't think it is the reason for VAE embeds being missing. however, to confirm this, you can try this pull request which uses atomicwrites to handle temp files instead of my prev custom logic.

if the vae embeds are still missing after this, i would recommend to check the logs for WARNING and ERROR messages to see whether there is some hint as to why images are not encoding or getting ignored.

for example using vae_batch_size too large on 24G card with high resolution will OOM.

Unfortunately made no difference for me.

I don't think any of these warnings are relevant:

2024-12-11 19:28:50,786 [WARNING] Skipping false argument: --disable_benchmark
2024-12-11 19:28:50,786 [WARNING] Skipping false argument: --validation_torch_compile
2024-12-11 19:28:50,790 [WARNING] The VAE model madebyollin/sdxl-vae-fp16-fix is not compatible. Please use a compatible VAE to eliminate this warning. The baked-in VAE will be used, instead.
2024-12-11 19:28:50,790 [WARNING] Updating T5 XXL tokeniser max length to 512 for Flux.
2024-12-11 19:29:17,450 [WARNING] Not using caption dropout will potentially lead to overfitting on captions, eg. CFG will not work very well. Set --caption_dropout_probability=0.1 as a recommended value.
2024-12-11 19:29:17,454 [WARNING] No cache file found, creating new one.

Other than those there is no indication of anything going wrong. I am testing on an A100 with no added parameters outside of the configure script.

The same method that works just fine when I have tested in on SD3.5 for example. But flux just refuses to work.

Answer 6 · 2024-12-11T20:51:27.000Z

it's weird because my test dataset was used on Flux and I had it clear all of the embeds to generate them first. then training continued as expected; this was on a m3 max notebook as well as a 4090

can you share other details about the setup (eg. config.json and multidatabackend.json) or maybe set SIMPLETUNER_LOG_LEVEL=DEBUG in config/config.env (you may have to create this file) and then check debug.log in the simpletuner directory for any details in the VAECache log lines that indicate whether it even tried to cache any image embeds etc.

Answer 7 · 2024-12-11T21:27:01.000Z

I just use what the configure script spits out, removed a few of the entries in the multidatabackend to speed things up when testing this. I have tried enabling SIMPLETUNER_LOG_LEVEL=DEBUG before, I didn't see anything notable in the console, but maybe the log file has more info.

I could maybe check again tomorrow.

[
    {
        "id": "my-dataset-1024",
        "type": "local",
        "instance_data_dir": "images/test",
        "crop": false,
        "crop_style": "random",
        "minimum_image_size": 128,
        "resolution": 1024,
        "resolution_type": "pixel_area",
        "repeats": 1,
        "metadata_backend": "discovery",
        "caption_strategy": "textfile",
        "cache_dir_vae": "cache//vae-1024"
    },
    {
        "id": "text-embed-cache",
        "dataset_type": "text_embeds",
        "default": true,
        "type": "local",
        "cache_dir": "cache//text"
    }
]

{
    "--resume_from_checkpoint": "latest",
    "--data_backend_config": "config/multidatabackend.json",
    "--aspect_bucket_rounding": 2,
    "--seed": 42,
    "--minimum_image_size": 0,
    "--disable_benchmark": false,
    "--output_dir": "output/models",
    "--num_train_epochs": 100,
    "--max_train_steps": 0,
    "--checkpointing_steps": 500,
    "--checkpoints_total_limit": 20,
    "--attention_mechanism": "diffusers",
    "--report_to": "none",
    "--model_type": "full",
    "--pretrained_model_name_or_path": "black-forest-labs/FLUX.1-dev",
    "--model_family": "flux",
    "--train_batch_size": 1,
    "--gradient_checkpointing": "true",
    "--caption_dropout_probability": 0.0,
    "--resolution_type": "pixel_area",
    "--resolution": 1024,
    "--validation_seed": 42,
    "--validation_steps": 500,
    "--validation_resolution": "1024x1024",
    "--validation_guidance": 3.0,
    "--validation_guidance_rescale": "0.0",
    "--validation_num_inference_steps": "20",
    "--validation_prompt": "A photo-realistic image of a cat",
    "--mixed_precision": "bf16",
    "--optimizer": "adamw_bf16",
    "--learning_rate": "5e-5",
    "--lr_scheduler": "polynomial",
    "--lr_warmup_steps": "0",
    "--validation_torch_compile": "false"
}

Answer 8 · 2024-12-11T21:38:29.000Z

the file is the only place any debug logs go, the console log will be unchanged

Answer 9 · 2024-12-12T01:49:49.000Z

The issue might be related to subfolders, when I pointed the multidatabackend instance_data_dir directly to one of my subfolders with images in it then it created vae cache files for the images.

This is very strange though, as it is finding all .txt files in the subfolders. And subfolders are not an issue on SD3.5.

Answer 10 · 2024-12-13T18:56:21.000Z

@bghira If the problem is somehow related to folder structure, that could also explain why no one else has reported any issues. as 99% of people are training single concept loras.

Answer 11 · 2024-12-13T18:59:12.000Z

other issues with subfolders have been reported before and the guidance is usually "don't use subfolders, or add each subfolder as its own dataset entry"

Answer 12 · 2024-12-13T18:59:49.000Z

it's actually seeming that you are using relative paths to the data image dir, which can't be done when using subfolders. try using a full path.

Answer 13 · 2024-12-13T19:01:09.000Z

It "can't" be done for Flux specifically, only for VAE files?

Answer 14 · 2024-12-13T19:04:27.000Z

it is a problem for every single type of model trainable by simpletuner when using subdirectory structures. it has trouble linking images to their cache dir since the strings are harder to work with. it doesn't force the directory to be an absolute path for other reasons. it's not an easy problem to solve and so it remains like this as there is not a whole lot of free time to sit and work on an edge case really.

Answer 15 · 2024-12-13T19:05:07.000Z

I don't see how the logic for this would change depending on model.

Answer 16 · 2024-12-13T19:05:48.000Z

it doesn't change, that is what i just said. anyway. it will not be worked on at this point, but pull requests to resolve the issue can be reviewed for inclusion. sorry.

Answer 17 · 2024-12-13T19:06:55.000Z

Fair enough I don't expect you to drop anything if the only issue is absolute vs relative paths. But I am so confused as to why this would not be an issue for SD3.5 then.

Answer 18 · 2024-12-13T19:27:01.000Z

yeah i'm not sure it is working fully/correctly there if it's not working for flux, and it definitely breaks for SDXL where this was initially reported

the thing is that the embedding logic is identical up to the point where the actual embed computation happens. the file path handling is all abstracted away from the embed computation. which is why it is a problem across the board when it happens.