An error occurred while writing embeddings to disk. (Flux full fine tune)
Closed this issue · 18 comments
I am trying to do full fine tuning on Flux dev. I followed the quickstart guide. The exact same process works fine for SD 3.5 using the same dataset.
I've also tried different datasets, lora training, on both Runpod and a dedicated server.
But no matter what I do it errors while writing embeds to disk when using Flux.
I've tried ai-toolkit and kohya with the same dataset for flux lora training without issues, which I assume rules out corruption of any files in the dataset.
Write embeds to disk: 51%|████████████████████████████████▍ | 271/534 [00:33<00:30, 8.65it/s]2024-12-10 14:14:18,422 [ERROR] An error occurred while writing embeddings to disk. Traceback (most recent call last): File "/workspace/SimpleTuner/helpers/caching/text_embeds.py", line 241, in batch_write_embeddings self.process_write_batch(batch) File "/workspace/SimpleTuner/helpers/caching/text_embeds.py", line 273, in process_write_batch future.result() # Wait for all writes to complete ^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/concurrent/futures/_base.py", line 456, in result return self.__get_result() ^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result raise self._exception File "/usr/lib/python3.11/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/workspace/SimpleTuner/helpers/data_backend/local.py", line 261, in torch_save os.rename(temp_file_path, filepath) FileNotFoundError: [Errno 2] No such file or directory: '/workspace/SimpleTuner/cache/text/.b43332dd16f7a874aca03f64290d7532-flux.pt.tmp0' -> '/workspace/SimpleTuner/cache/text/b43332dd16f7a874aca03f64290d7532-flux.pt'
this is a duplicate issue. it is resolved on main branch.
this is a duplicate issue. it is resolved on main branch.
I deployed a new runpod instance to double check with the main branch for the 5th time or so.
What changes when going to main branch is that the writing to disk error doesn't show. But the end result is basically the same.
Error regarding issues with caching. The "vae-1024" folder is just empty.
Epoch 1/100 Steps: 0%| | 0/106800 [00:00<?, ?it/s](id=my-dataset-1024) Some images were not correctly cached during the VAE Cache operations. Ensure --skip_file_discovery=vae is not set.
@bghira I assume you didn't get a notification from my response as you immediately closed this issue.
I did, but I've been busy.
the logic itself I don't think it is the reason for VAE embeds being missing. however, to confirm this, you can try this pull request which uses atomicwrites
to handle temp files instead of my prev custom logic.
if the vae embeds are still missing after this, i would recommend to check the logs for WARNING
and ERROR
messages to see whether there is some hint as to why images are not encoding or getting ignored.
for example using vae_batch_size
too large on 24G card with high resolution will OOM.
I did, but I've been busy.
the logic itself I don't think it is the reason for VAE embeds being missing. however, to confirm this, you can try this pull request which uses
atomicwrites
to handle temp files instead of my prev custom logic.if the vae embeds are still missing after this, i would recommend to check the logs for
WARNING
andERROR
messages to see whether there is some hint as to why images are not encoding or getting ignored.for example using
vae_batch_size
too large on 24G card with high resolution will OOM.
Unfortunately made no difference for me.
I don't think any of these warnings are relevant:
2024-12-11 19:28:50,786 [WARNING] Skipping false argument: --disable_benchmark
2024-12-11 19:28:50,786 [WARNING] Skipping false argument: --validation_torch_compile
2024-12-11 19:28:50,790 [WARNING] The VAE model madebyollin/sdxl-vae-fp16-fix is not compatible. Please use a compatible VAE to eliminate this warning. The baked-in VAE will be used, instead.
2024-12-11 19:28:50,790 [WARNING] Updating T5 XXL tokeniser max length to 512 for Flux.
2024-12-11 19:29:17,450 [WARNING] Not using caption dropout will potentially lead to overfitting on captions, eg. CFG will not work very well. Set --caption_dropout_probability=0.1 as a recommended value.
2024-12-11 19:29:17,454 [WARNING] No cache file found, creating new one.
Other than those there is no indication of anything going wrong. I am testing on an A100 with no added parameters outside of the configure script.
The same method that works just fine when I have tested in on SD3.5 for example. But flux just refuses to work.
it's weird because my test dataset was used on Flux and I had it clear all of the embeds to generate them first. then training continued as expected; this was on a m3 max notebook as well as a 4090
can you share other details about the setup (eg. config.json and multidatabackend.json) or maybe set SIMPLETUNER_LOG_LEVEL=DEBUG
in config/config.env
(you may have to create this file) and then check debug.log
in the simpletuner directory for any details in the VAECache
log lines that indicate whether it even tried to cache any image embeds etc.
I just use what the configure script spits out, removed a few of the entries in the multidatabackend to speed things up when testing this. I have tried enabling SIMPLETUNER_LOG_LEVEL=DEBUG
before, I didn't see anything notable in the console, but maybe the log file has more info.
I could maybe check again tomorrow.
[
{
"id": "my-dataset-1024",
"type": "local",
"instance_data_dir": "images/test",
"crop": false,
"crop_style": "random",
"minimum_image_size": 128,
"resolution": 1024,
"resolution_type": "pixel_area",
"repeats": 1,
"metadata_backend": "discovery",
"caption_strategy": "textfile",
"cache_dir_vae": "cache//vae-1024"
},
{
"id": "text-embed-cache",
"dataset_type": "text_embeds",
"default": true,
"type": "local",
"cache_dir": "cache//text"
}
]
{
"--resume_from_checkpoint": "latest",
"--data_backend_config": "config/multidatabackend.json",
"--aspect_bucket_rounding": 2,
"--seed": 42,
"--minimum_image_size": 0,
"--disable_benchmark": false,
"--output_dir": "output/models",
"--num_train_epochs": 100,
"--max_train_steps": 0,
"--checkpointing_steps": 500,
"--checkpoints_total_limit": 20,
"--attention_mechanism": "diffusers",
"--report_to": "none",
"--model_type": "full",
"--pretrained_model_name_or_path": "black-forest-labs/FLUX.1-dev",
"--model_family": "flux",
"--train_batch_size": 1,
"--gradient_checkpointing": "true",
"--caption_dropout_probability": 0.0,
"--resolution_type": "pixel_area",
"--resolution": 1024,
"--validation_seed": 42,
"--validation_steps": 500,
"--validation_resolution": "1024x1024",
"--validation_guidance": 3.0,
"--validation_guidance_rescale": "0.0",
"--validation_num_inference_steps": "20",
"--validation_prompt": "A photo-realistic image of a cat",
"--mixed_precision": "bf16",
"--optimizer": "adamw_bf16",
"--learning_rate": "5e-5",
"--lr_scheduler": "polynomial",
"--lr_warmup_steps": "0",
"--validation_torch_compile": "false"
}
the file is the only place any debug logs go, the console log will be unchanged
The issue might be related to subfolders, when I pointed the multidatabackend instance_data_dir directly to one of my subfolders with images in it then it created vae cache files for the images.
This is very strange though, as it is finding all .txt files in the subfolders. And subfolders are not an issue on SD3.5.
@bghira If the problem is somehow related to folder structure, that could also explain why no one else has reported any issues. as 99% of people are training single concept loras.
other issues with subfolders have been reported before and the guidance is usually "don't use subfolders, or add each subfolder as its own dataset entry"
it's actually seeming that you are using relative paths to the data image dir, which can't be done when using subfolders. try using a full path.
It "can't" be done for Flux specifically, only for VAE files?
it is a problem for every single type of model trainable by simpletuner when using subdirectory structures. it has trouble linking images to their cache dir since the strings are harder to work with. it doesn't force the directory to be an absolute path for other reasons. it's not an easy problem to solve and so it remains like this as there is not a whole lot of free time to sit and work on an edge case really.
I don't see how the logic for this would change depending on model.
it doesn't change, that is what i just said. anyway. it will not be worked on at this point, but pull requests to resolve the issue can be reviewed for inclusion. sorry.
Fair enough I don't expect you to drop anything if the only issue is absolute vs relative paths. But I am so confused as to why this would not be an issue for SD3.5 then.
yeah i'm not sure it is working fully/correctly there if it's not working for flux, and it definitely breaks for SDXL where this was initially reported
the thing is that the embedding logic is identical up to the point where the actual embed computation happens. the file path handling is all abstracted away from the embed computation. which is why it is a problem across the board when it happens.